[PDF] Improved Cooperation by Exploiting a Common Signal

Abstract

Can artificial agents benefit from human conventions? Human societies manage to successfully self-organize and resolve the tragedy of the commons in common-pool resources, in spite of the bleak prediction of non-cooperative game theory. On top of that, real-world problems are inherently large-scale and of low observability. One key concept that facilitates human coordination in such settings is the use of conventions. Inspired by human behavior, we investigate the learning dynamics and emergence of temporal conventions, focusing on common-pool resources. Extra emphasis was given in designing a realistic evaluation setting: (a) environment dynamics are modeled on real-world fisheries, (b) we assume decentralized learning, where agents can observe only their own history, and (c) we run large-scale simulations (up to 64 agents). Uncoupled policies and low observability make cooperation hard to achieve; as the number of agents grow, the probability of taking a correct gradient direction decreases exponentially. By introducing an arbitrary common signal (e.g., date, time, or any periodic set of numbers) as a means to couple the learning process, we show that temporal conventions can emerge and agents reach sustainable harvesting strategies. The introduction of the signal consistently improves the social welfare (by 258% on average, up to 3306%), the range of environmental parameters where sustainability can be achieved (by 46% on average, up to 300%), and the convergence speed in low abundance settings (by 13% on average, up to 53%).

Full PDF

aa r X i v : . [ c s . M A ] F e b Improved Cooperation by Exploiting a Common Signal

Learning Temporal Conventions for Sustainable Appropriation of Common-Pool Resources

Panayiotis Danassis

École Polytechnique Fédérale deLausanne (EPFL)Artiﬁcial Intelligence LaboratoryLausanne, Switzerlandpanayiotis.danassis@epﬂ.ch

Zeki Doruk Erden

École Polytechnique Fédérale deLausanne (EPFL)Artiﬁcial Intelligence LaboratoryLausanne, Switzerlandzeki.erden@epﬂ.ch

Boi Faltings

École Polytechnique Fédérale deLausanne (EPFL)Artiﬁcial Intelligence LaboratoryLausanne, Switzerlandboi.faltings@epﬂ.ch

ABSTRACT

Can artiﬁcial agents beneﬁt from human conventions? Human so-cieties manage to successfully self-organize and resolve the tragedyof the commons in common-pool resources, in spite of the bleakprediction of non-cooperative game theory. On top of that, real-world problems are inherently large-scale and of low observabil-ity. One key concept that facilitates human coordination in suchsettings is the use of conventions. Inspired by human behavior, weinvestigate the learning dynamics and emergence of temporal con-ventions, focusing on common-pool resources. Extra emphasis wasgiven in designing a realistic evaluation setting : (a) environment dy-namics are modeled on real-world ﬁsheries, (b) we assume decen-tralized learning, where agents can observe only their own history,and (c) we run large-scale simulations (up to 64 agents).Uncoupled policies and low observability make cooperation hardto achieve; as the number of agents grow, the probability of tak-ing a correct gradient direction decreases exponentially. By intro-ducing an arbitrary common signal (e.g., date, time, or any peri-odic set of numbers) as a means to couple the learning process, weshow that temporal conventions can emerge and agents reach sus-tainable harvesting strategies. The introduction of the signal con-sistently improves the social welfare (by 258% on average, up to3306%), the range of environmental parameters where sustainabil-ity can be achieved (by 46% on average, up to 300%), and the con-vergence speed in low abundance settings (by 13% on average, upto 53%).

KEYWORDS

Multi-agent Deep Reinforcement Learning; Coordination; ResourceAllocation; Sustainability; Social Conventions; Social Dilemmas

ACM Reference Format:

Panayiotis Danassis, Zeki Doruk Erden, and Boi Faltings. 2021. ImprovedCooperation by Exploiting a Common Signal: Learning Temporal Conven-tions for Sustainable Appropriation of Common-Pool Resources. In

Proc.of the 20th International Conference on Autonomous Agents and MultiagentSystems (AAMAS 2021), Online, May 3–7, 2021 , IFAAMAS, 22 pages.

The question of cooperation in socio-ecological systems and sus-tainability in the use of common-pool resources constitutes a criti-cal open problem. Classical non-cooperative game theory suggests

Proc. of the 20th International Conference on Autonomous Agents and Multiagent Sys-tems (AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021,Online that rational individuals will exhaust a common resource, ratherthan sustain it for the beneﬁt of the group, resulting in the ‘thetragedy of the commons’ [17]. The tragedy of the commons ariseswhen it is challenging and/or costly to exclude individuals fromappropriating common-pool resources (CPR) of ﬁnite yield [37].Individuals face strong incentives to appropriate, which results in overuse and even permanent depletion of the resource. Examples in-clude the degradation of fresh water resources, the over-harvestingof timber, the depletion of grazing pastures, the destruction of ﬁsh-eries, etc.In spite of the bleak prediction of non-cooperative game theory,the tragedy of the commons is not inevitable, though conditionsunder which cooperation and sustainability can be achieved maybe more demanding, the higher the stakes. Nevertheless, humanshave been systematically shown to successfully self-organize andresolve the tragedy of the commons in CPR appropriation prob-lems, even without the imposition of an extrinsic incentive struc-ture [36]. E.g., by enabling the capacity to communicate, individ-uals have been shown to maintain the harvest to an optimal level[6, 37]. Though, communication creates overhead, and might notalways be possible [44]. One of the key ﬁndings of empirical ﬁeldresearch on sustainable CPR regimes around the world is the em-ployment of boundary rules , which prescribe who is authorizedto appropriate from a resource [36]. Such boundary rules can beof temporal nature, prescribing the temporal order in which peo-ple harvest from a common-pool resource (e.g., ‘protocol of play’[3]). The aforementioned rules can be enforced by an authority, oremerge in a self-organized manner (e.g., by utilizing environmen-tal signals such as the time, date, season, etc.) in the form of a socialconvention .Many real-world CPR problems are inherently large-scale and partially observable , which further increases the challenge of sus-tainability. In this work we deal with the most information-restrictivesetting : each participant is modeled as an individual agent with itsown policy conditioned only on local information , speciﬁcally hisown history of action/reward pairs ( fully decentralized method).Global observations, including the resource stock, the number ofparticipants, and the joint observations and actions, are hidden –as is the case in many real-world applications, like commercialﬁsheries. Under such a setting, it is impossible to avoid positiveprobability mass on undesirable actions (i.e., simultaneous appro-priation), since there is no correlation between the agents’ poli-cies. This leads to either low social welfare, because the agents arebeing conservative, or, even worse, the depletion of the resource.Depletion becomes more likely as the problem size grows due tohe non-stationarity of the environment and the global explorationproblem .We propose a simple technique: allow agents to observe an arbi-trary, common signal from the environment. Observing a commonsignal mitigates the aforementioned problems because it allows for coupling between the learned policies, increasing the joint policyspace. Agents, for example, can now learn to harvest in turns, andwith varying eﬀorts per signal value, or allow for fallow periods.The beneﬁt is twofold: the agents learn to not only avoid deple-tion, but also to maintain a healthy stock which allows for largeharvest and, thus, higher social welfare. It is important to stressthat we do not assume any a priori relation between the signal spaceand the problem at hand . Moreover, we require no communication,no extrinsic incentive mechanism, and we do not change the un-derlying architecture, or learning algorithm. We simply utilize ameans – common environmental signals that are amply availableto the agents [18] – to accommodate correlation between policies.This in turn enables the emergence of ordering conventions of tem-poral nature (henceforth referred to as temporal conventions) and sustainable harvesting strategies. (1) We are the ﬁrst to introduce a realistic common-pool re-source appropriation game for multi-agent coordination , basedon bio-economic models of commercial ﬁsheries, and provide the-oretical analysis on the dynamics of the environment. (2) We propose a simple and novel technique: allow agentsto observe an arbitrary periodic environmental signal.

Suchsignals are amply available in the environment (e.g., time, date etc.)and can foster cooperation among agents. (3) We provide a thorough (quantitative & qualitative) anal-ysis on the learned policies and demonstrate signiﬁcant improve-ments on sustainability, social welfare, and convergence speed.

As autonomous agents proliferate, they will be called upon to inter-act in ever more complex environments. This will bring forth theneed for techniques that enable the emergence of sustainable coop-eration. Despite the growing interest in and success of multi-agentdeep reinforcement learning (MADRL), scaling to environmentswith a large number of learning agents continues to be a prob-lem [15]. A multi-agent setting is inherently susceptible to manypitfalls: non-stationarity (moving-target problem), curse of dimen-sionality, credit assignment, global exploration, relative overgen-eralization [20, 33, 48] . Recent advances in the ﬁeld of MADRLdeal with only a limited number of agents. It is shown that as thenumber of agents increase, the probability of taking a correct gradi-ent direction decreases exponentially [20], thus the proposed meth-ods cannot be easily generalized to complex scenarios with manyagents. Some of these adversities can be mitigated by the centralized training, decentralizedexecution paradigm. Yet, centralized methods likewise suﬀer from a plethora of otherproblems: they are computationally heavy, assume unlimited communication (whichis impractical in many real-world applications), the exact same team has to be de-ployed (in the real-world we cooperate with strangers), and, most importantly, thesize of the joint action space grows exponentially with the number of agents.

Our approach aims to mitigate the aforementioned problems ofMADRL by introducing coupling between the learned policies. It isimportant to note that the proposed approach does not change theunderlying architecture of the network (the capacity of the net-work stays the same), nor the learning algorithm or the rewardstructure. We simply augment the input space by allowing the ob-servation of an arbitrary common signal. The signal has no a priorirelation to the problem, i.e., we do not need to design an additionalfeature ; in fact we use a periodic sequence of arbitrary integers . It isstill possible for the original network (without the signal) to learna sustainable strategy. Nevertheless, we show that the simple act ofaugmenting the input space drastically increases the social welfare,speed of convergence, and the range of environmental parametersin which sustainability can be achieved. Most importantly, the pro-posed approach requires no communication, creates no additionaloverhead, it is simple to implement, and scalable.The proposed technique was inspired by temporal conventionsin resource allocation games of non-cooperative game theory. Theclosest analogue is the courtesy convention of [11], where ratio-nal agents learn to coordinate their actions to access a set of in-divisible resources by observing a signal from the environment.Closely related is the concept of the correlated equilibrium (CE)[1, 35], which, from a practical perspective, constitutes perhaps themost relevant non-cooperative solution concept [18] . Most impor-tantly, it is possible to achieve a correlated equilibrium without acentral authority, simply by utilizing meaningless environmentalsignals [2, 7, 11]. Such common environmental signals are amplyavailable to the agents [18]. The aforementioned line of researchstudies pre-determined strategies of rational agents. Instead, westudy the emergent behaviors of a group of independent learningagents aiming to maximize the long term discounted reward.A second source of inspiration is behavioral conventions; oneof the key concepts that facilitates human coordination . A con-vention is deﬁned as a customary, expected, and self-enforcing be-havioral pattern [28, 49]. It can be considered as a behavioral rule,designed and agreed upon ahead of time [43, 46], or it may emergefrom within the system itself [34, 46]. The examined temporal con-vention in this work falls on the latter category.Moving on to the application domain, there has been great inter-est recently in CPR problems (and more generally, social dilemmas[24]) as an application domain for MADRL [21, 23, 25, 26, 32, 38–40, 47]. CPR problems oﬀer complex environment dynamics andrelate to real-world socio-ecological systems. There are a few dis-tinct diﬀerences between the CPR models presented in the afore-mentioned works and the model introduced in this paper: First andforemost, we designed our model to resemble reality as closely aspossible using bio-economic models of commercial ﬁsheries [8, 12],resulting in complex environment dynamics. Second, we have a continuous action space which further complicates the learning pro-cess. Finally, we opted not to learn from visual input (raw pixels).The problem of direct policy approximation from visual input does Correlated equilibria also relate to boundary rules and temporal conventions in hu-man societies; the most prominent example of a CE in real life is the traﬃc lights,which can also be viewed as a temporal convention for the use of the road. Humans are able to routinely and robustly cooperate in their every day lives in large-scale and under dynamic and unpredictable demand. They also have access to auxil-iary information that help correlated their actions (e.g., time, date etc.). ot add complexity to the social dilemma itself; it only adds com-plexity in the feature extraction of the state. It requires large net-works because of the additional complexity of extracting featuresfrom pixels, while only a small part of what is learned is the actualpolicy [10]. Most importantly, it makes harder to study the policyin isolation, as we do in this work. Moreover, from a practical per-spective, learning from a visual input would be meaningless, giventhat we are dealing with a low observability scenario where the re-source stock and the number and actions of the participants arehidden.In terms of the methodology for dealing with the tragedy ofthe commons, the majority of the aforementioned literature fallsbroadly into two categories: Reward shaping [21, 23, 40], whichrefers to adding a term to the extrinsic reward an agent receivesfrom the environment, and opponent shaping [25, 32, 38], whichrefers to manipulating the opponent (by e.g., sharing rewards, pun-ishments, or adapting your own actions). Contrary to that, we onlyallow agents to observe an existing environmental signal.

We donot modify the intrinsic or extrinsic rewards, design new features, orrequire a communication network . Finally, boundary rules emergedin [38] as well in the form of spatial territories. Such territories canincrease inequality, while we maintain high levels of fairness.

We consider a decentralized multi-agent reinforcement learningscenario in a partially observable general-sum Markov game [42].At each time-step, agents take actions based on a partial obser-vation of the state space, and receive an individual reward. Eachagent learns a policy independently. More formally, let N = { , . . . , 𝑁 } denote the set of agents, and M be an 𝑁 -player, partially observ-able Markov game deﬁned on a set of states S . An observationfunction O 𝑛 : S → R 𝑑 speciﬁes agent 𝑛 ’s 𝑑 -dimensional view ofthe state space. Let A 𝑛 denote the set of actions for agent 𝑛 ∈ N ,and 𝒂 = × ∀ 𝑛 ∈N 𝑎 𝑛 , where 𝑎 𝑛 ∈ A 𝑛 , the joint action. The stateschange according to a transition function T : S×A ×· · ·×A 𝑁 → Δ (S) , where Δ (S) denotes the set of discrete probability distribu-tions over S . Every agent 𝑛 receives an individual reward basedon the current state 𝜎 𝑡 ∈ S and joint action 𝒂 𝑡 . The latter is givenby the reward function 𝑟 𝑛 : S × A × · · · × A 𝑁 → R . Finally,each agent learns a policy 𝜋 𝑛 : O 𝑛 → Δ (A 𝑛 ) independentlythrough their own experience of the environment (observationsand rewards). Let 𝝅 = × ∀ 𝑛 ∈N 𝜋 𝑛 denote the joint policy. The goalfor each agent is to maximize the long term discounted payoﬀ, asgiven by 𝑉 𝑛 𝝅 ( 𝜎 ) = E (cid:2)Í ∞ 𝑡 = 𝛾 𝑡 𝑟 𝑛 ( 𝜎 𝑡 , 𝒂 𝑡 )| 𝒂 𝑡 ∼ 𝝅 𝑡 , 𝜎 𝑡 + ∼ T ( 𝜎 𝑡 , 𝒂 𝑡 ) (cid:3) ,where 𝛾 is the discount factor and 𝜎 is the initial state. In order to better understand the impact of self-interested appro-priation, it would be beneﬁcial to examine the dynamics of real-world common-pool renewable resources. To that end, we presentan abstracted bio-economic model for commercial ﬁsheries [8, 12].The model describes the dynamics of the stock of a common-poolrenewable resource, as a group of appropriators harvest over time.The harvest depends on (i) the eﬀort exerted by the agents and (ii) the ease of harvesting a resource at that point of time, which de-pends on the stock level. The stock replenishes over time with arate dependent on the current stock level.More formally, let N denote the set of appropriators, 𝜖 𝑛,𝑡 ∈[ , E 𝑚𝑎𝑥 ] the eﬀort exerted by agent 𝑛 at time-step 𝑡 , and 𝐸 𝑡 = Í 𝑛 ∈N 𝜖 𝑛,𝑡 the total eﬀort at time-step 𝑡 . The total harvest is givenby Eq. 1, where 𝑠 𝑡 ∈ [ , ∞) denotes the stock level (i.e., amount ofresources) at time-step 𝑡 , 𝑞 (·) denotes the catchability coeﬃcient(Eq. 2), and 𝑆 𝑒𝑞 is the equilibrium stock of the resource. 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ) = ( 𝑞 ( 𝑠 𝑡 ) 𝐸 𝑡 , if 𝑞 ( 𝑠 𝑡 ) 𝐸 𝑡 ≤ 𝑠 𝑡 𝑠 𝑡 , otherwise (1) 𝑞 ( 𝑥 ) = ( 𝑥 𝑆 𝑒𝑞 , if 𝑥 ≤ 𝑆 𝑒𝑞 , otherwise (2)Each environment can only sustain a ﬁnite amount of stock.If left unharvested, the stock will stabilize at 𝑆 𝑒𝑞 . Note also that 𝑞 (·) , and therefore 𝐻 (·) , are proportional to the current stock, i.e.,the higher the stock, the larger the harvest for the same total ef-fort. The stock dynamics are governed by Eq. 3, where 𝐹 (·) is thespawner-recruit function (Eq. 4) which governs the natural growthof the resource, and 𝑟 is the growth rate. To avoid highly skewedgrowth models and unstable environments (‘behavioral sink’ [4,5]), 𝑟 ∈ [− 𝑊 (− /( 𝑒 )) , − 𝑊 − (− /( 𝑒 ))] ≈ [ . , . ] , where 𝑊 𝑘 (·) is the Lambert 𝑊 function (see Section C for details). 𝑠 𝑡 + = 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 )) (3) 𝐹 ( 𝑥 ) = 𝑥𝑒 𝑟 ( − 𝑥𝑆𝑒𝑞 ) (4)We assume that the individual harvest is proportional to the ex-erted eﬀort (Eq. 5), and the revenue of each appropriator is given byEq. 6, where 𝑝 𝑡 is the price ($ per unit of resource), and 𝑐 𝑡 is the cost($) of harvesting (e.g., operational cost, taxes, etc.). Here lies the‘tragedy’: the beneﬁts from harvesting are private ( 𝑝 𝑡 ℎ 𝑛,𝑡 ( 𝜖 𝑛,𝑡 , 𝑠 𝑡 ) ),but the loss is borne by all (in terms of a reduced stock, see Eq. 3). ℎ 𝑛,𝑡 ( 𝜖 𝑛,𝑡 , 𝑠 𝑡 ) = 𝜖 𝑛,𝑡 𝐸 𝑡 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ) (5) 𝑢 𝑛,𝑡 ( 𝜖 𝑛,𝑡 , 𝑠 𝑡 ) = 𝑝 𝑡 ℎ 𝑛,𝑡 ( 𝜖 𝑛,𝑡 , 𝑠 𝑡 ) − 𝑐 𝑡 (6) Optimal Harvesting . The question that naturally arises is:what is the ‘optimal’ eﬀort in order to harvest a yield that max-imizes the revenue (Eq. 6). We make two assumptions: First, weassume that the entire resource is owned by a single entity (e.g., aﬁrm or the government), which possesses complete knowledge ofand control over the resource. Thus, we only have a single controlvariable, 𝐸 𝑡 . This does not change the underlying problem sincethe total harvested resources are linear in the proportion of eﬀortsput by individual agents (Eq. 5). Second, we consider the case ofzero discounting, i.e., future revenues are weighted equally withcurrent ones. Of course ﬁrms (and individuals) do discount the fu-ture and bio-economic models should take that into account, butthis complicates the analysis and it is out of the scope of this work.We argue we can still draw useful insight into the problem.Our control problem consists of of ﬁnding a piecewise continu-ous control 𝐸 𝑡 , so as to maximize the total revenue (max 𝐸 𝑡 Í 𝑇𝑡 = 𝑈 𝑡 ( 𝐸 𝑡 , 𝑠 𝑡 ) ).The maximization problem can be solved using Optimal ControlTheory [13, 27]. We have proven the following theorem: heorem 2.1. The optimal control variable 𝐸 ∗ 𝑡 that solves the max-imization problem max 𝐸 𝑡 Í 𝑇𝑡 = 𝑈 𝑡 ( 𝐸 𝑡 , 𝑠 𝑡 ) given the model dynamicsdescribed in Section 2.2 is given by Eq. 7, where 𝜆 𝑡 are the adjointvariables of the Hamiltonians: 𝐸 ∗ 𝑡 + = ( 𝐸 𝑚𝑎𝑥 , if ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ))) ≥ , if ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ))) < Proof. (sketch) We formulate the Hamiltonians [13, 27], whichturn out to be linear in the control variables 𝐸 𝑡 + with coeﬃcients ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ))) . Thus, the optimal sequence of 𝐸 𝑡 + that maximizes the Hamiltonians is given according to thesign of those coeﬃcients. See Section B for the complete proof. (cid:3) The optimal strategy is a bang–bang controller, which switchesbased on the adjoint variable values, stock level, and price. The val-ues for 𝜆 𝑡 do not have a closed form expression (because of the dis-continuity of the control), but can be found iteratively for a givenset of environment parameters ( 𝑟 , 𝑆 𝑒𝑞 ) and the adjoint equations[13, 27]. However, the discontinuity in the control input makessolving the adjoint equations quite cumbersome. We can utilizeiterative forward/backward methods as in [13], but this is out ofthe scope of this paper.There are a few interesting key points. First, to compute the op-timal level of eﬀort we require observability of the resource stock,which is not always a realistic assumption (in fact in this work wedo not make this assumption). Second, we require complete knowl-edge of the strategies of the other appropriators. Third, even ifboth the aforementioned conditions are met, the bang-bang con-troller of Eq. 7 does not have a constant transition limit; the limitchanges at each time time-step, determined by the adjoint variable 𝜆 𝑡 + , thus ﬁnding the switch times remains quite challenging. Harvesting at Maximum Eﬀort . To gain a deeper under-standing of the dynamics of the environment, we will now considera baseline strategy where every agent harvests with the maximumeﬀort at every time-step, i.e., 𝜖 𝑛,𝑡 = E 𝑚𝑎𝑥 , ∀ 𝑛 ∈ N , ∀ 𝑡 . This corre-sponds to the Nash Equilibrium of a stage game (myopic agents).For a constant growth rate 𝑟 and a given number of agents 𝑁 ,we can identify two interesting stock equilibrium points ( 𝑆 𝑒𝑞 ): the‘limit of sustainable harvesting’, and the ‘limit of immediate deple-tion’.The limit of sustainable harvesting ( 𝑆 𝑁,𝑟𝐿𝑆𝐻 ) is the stock equilib-rium point where the goal of sustainable harvesting becomes triv-ial: for any 𝑆 𝑒𝑞 > 𝑆 𝑁,𝑟𝐿𝑆𝐻 , the resource will not get depleted, even ifall agents harvest at maximum eﬀort. Note that the coordinationproblem remains far from trivial even for 𝑆 𝑒𝑞 > 𝑆 𝑁,𝑟𝐿𝑆𝐻 , especiallyfor increasing population sizes. Exerting maximum eﬀort in envi-ronments with 𝑆 𝑒𝑞 close to 𝑆 𝑁,𝑟𝐿𝑆𝐻 will yield low returns because thestock remains low, resulting in a small catchability coeﬃcient. Infact, this can be seen in Fig. 1 which depicts the social welfare (SW),i.e., sum of utilities, against increasing 𝑆 𝑒𝑞 values ( 𝑁 ∈ [ , ] , E 𝑚𝑎𝑥 = 𝑟 = denote the 𝑆 𝑁,𝑟𝐿𝑆𝐻 . Thus, the challengeis not only to keep the strategy sustainable, but to keep the resourcestock high, so that the returns can be high as well . Slight deviations from the predicted theoretical values of Eq. 8 due to the ﬁniteepisode length and non-zero threshold.

Figure 1: Social welfare (SW) – normalized by the maximumSW obtained in each setting – against increasing 𝑆 𝑒𝑞 values. 𝑁 ∈ [ , ] , E 𝑚𝑎𝑥 = , and 𝑟 = . 𝑥 -axis is in logarithmic scale. On the other end of the spectrum, the limit of immediate de-pletion ( 𝑆 𝑁,𝑟𝐿𝐼𝐷 ) is the stock equilibrium point where the resource isdepleted in one time-step (under maximum harvest eﬀort by all theagents). The problem does not become impossible for 𝑆 𝑒𝑞 ≤ 𝑆 𝑁,𝑟𝐿𝐼𝐷 ,yet, exploration can have catastrophic eﬀects (amplifying the prob-lem of global exploration in MARL). The following two theoremsprove the formulas for 𝑆 𝑁,𝑟𝐿𝑆𝐻 and 𝑆 𝑁,𝑟𝐿𝐼𝐷 . Theorem 2.2.

The limit of sustainable harvesting 𝑆 𝑁,𝑟𝐿𝑆𝐻 for a con-tinuous resource governed by the dynamics of Section 2.2, assumingthat all appropriators harvest with the maximum eﬀort E 𝑚𝑎𝑥 , is: 𝑆 𝑁,𝑟𝐿𝑆𝐻 = 𝑒 𝑟 𝑁 E 𝑚𝑎𝑥 ( 𝑒 𝑟 − ) (8) Proof.

Note that for 𝑆 𝑒𝑞 > 𝑆 𝑁,𝑟𝐿𝑆𝐻 , 𝑞 ( 𝑠 𝑡 ) 𝐸 𝑡 < 𝑠 𝑡 , ∀ 𝑡 , otherwisethe resource would be depleted. Moreover, if 𝑠 = 𝑆 𝑒𝑞 – which is anatural assumption, since prior to any intervention the stock willhave stabilized on the ﬁxed point – then 𝑠 𝑡 < 𝑆 𝑒𝑞 , ∀ 𝑡 . Thus, wecan re-write Eq. 1 and 2 as: 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ) = 𝑠 𝑡 𝑆 𝑒𝑞 𝐸 𝑡 = 𝑠 𝑡 𝑁 E 𝑚𝑎𝑥 𝑆 𝑒𝑞 Let 𝛼 , 𝑁 E 𝑚𝑎𝑥 𝑆 𝑒𝑞 , and 𝛽 = − 𝛼 . The state transition becomes: 𝑠 𝑡 + = 𝐹 ( 𝑠 𝑡 − 𝛼𝑠 𝑡 ) = 𝛽𝑠 𝑡 𝑒 𝑟 ( − 𝛽𝑆𝑒𝑞 𝑠 𝑡 ) We write it as a diﬀerence equation: Δ 𝑡 ( 𝑠 𝑡 ) , 𝑠 𝑡 + − 𝑠 𝑡 = ( 𝛽𝑒 𝑟 ( − 𝛽𝑆𝑒𝑞 𝑠 𝑡 ) − ) 𝑠 𝑡 At the limit of sustainable harvesting, as the stock diminishesto 𝑠 𝑡 = 𝛿 →

0, to remain sustainable it must be that Δ 𝑡 ( 𝑠 𝑡 ) > 𝑠 𝑡 → + 𝑠𝑔𝑛 ( Δ 𝑡 ( 𝑠 𝑡 )) > 𝑠 𝑡 → + > ⇒ 𝛽𝑒 𝑟 − > ⇒ 𝑆 𝑒𝑞 > 𝑒 𝑟 𝑁 E 𝑚𝑎𝑥 ( 𝑒 𝑟 − ) (cid:3) Theorem 2.3.

The limit of immediate depletion 𝑆 𝑁,𝑟𝐿𝐼𝐷 for a contin-uous resource governed by the dynamics of Section 2.2, assuming thatall appropriators harvest with the maximum eﬀort E 𝑚𝑎𝑥 , is given by: 𝑆 𝑁,𝑟𝐿𝐼𝐷 = 𝑁 E 𝑚𝑎𝑥 Given that 𝑟 ∈ [− 𝑊 (− /( 𝑒 )) , − 𝑊 − (− /( 𝑒 )) ] . In practice, 𝛿 is enforced by the granularity of the resource. roof. The resource is depleted if: 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ) = 𝑠 𝑡 ⇒ 𝑞 ( 𝑠 𝑡 ) 𝐸 𝑡 ≥ 𝑠 𝑡 ⇒ 𝑠 𝑡 𝑆 𝑒𝑞 𝐸 𝑡 ≥ 𝑠 𝑡 ⇒ 𝑆 𝑒𝑞 ≤ 𝑁 E 𝑚𝑎𝑥 (cid:3) We introduce an auxiliary signal; side information from the envi-ronment (e.g., time, date etc.) that agents can potentially use inorder to facilitate coordination and reach more sustainable strate-gies. Real-world examples include shepherds that graze on partic-ular days of the week or ﬁshermen that ﬁsh on particular months.In our case, the signal can be thought as a mechanism to increasethe set of possible (individual and joint) policies. Such signals are amply available to the agents [11, 18].

We do not assume any a priorirelation between the signal and the problem at hand . In fact, in thispaper we use a set of arbitrary integers, that repeat periodically.We use G = { , . . . , 𝐺 } to denote the set of signal values. Environment Settings . Let 𝑝 𝑡 =

1, and 𝑐 𝑡 = ∀ 𝑡 . Weset the growth rate at 𝑟 =

1, the initial population at 𝑠 = 𝑆 𝑒𝑞 ,and the maximum eﬀort at E 𝑚𝑎𝑥 =

1. The ﬁndings of Section2.2.2 provide a guide on the selection of the 𝑆 𝑒𝑞 values. Speciﬁ-cally we simulated environments with 𝑆 𝑒𝑞 given by Eq. 10, where 𝐾 = 𝑆 𝑁,𝑟𝐿𝑆𝐻 𝑁 = 𝑒 𝑟 E 𝑚𝑎𝑥 ( 𝑒 𝑟 − ) ≈ .

79 is a constant and 𝑀 𝑠 ∈ R + is a mul-tiplier that adjusts the scarcity (diﬃculty). 𝑀 𝑠 = 𝑆 𝑒𝑞 = 𝑆 𝑁,𝑟𝐿𝑆𝐻 . 𝑆 𝑒𝑞 = 𝑀 𝑠 𝐾𝑁 (10) Agent Architecture . Each agent uses a two-layer (64 neu-rons each) neural network for the policy approximation. The input(observation 𝑜 𝑛 = O 𝑛 ( 𝑆 ) ) is a tuple h 𝜖 𝑛,𝑡 − , 𝑢 𝑛,𝑡 − ( 𝜖 𝑛,𝑡 − , 𝑠 𝑡 − ) , 𝑔 𝑡 i consisting of the individual eﬀort exerted and reward obtained inthe previous time-step and the current signal value. The output isa continuous action value 𝑎 𝑡 = 𝜖 𝑛,𝑡 ∈ [ , E 𝑚𝑎𝑥 ] specifying the cur-rent eﬀort level. The policies are trained using the Proximal Pol-icy Optimization (PPO) algorithm [41]. PPO was chosen becauseit avoids large policy updates, ensuring a smoother training, andavoiding catastrophic failures. The reward received from the envi-ronment corresponds to the revenue, i.e., 𝑟 𝑛 ( 𝜎 𝑡 , 𝒂 𝑡 ) = 𝑢 𝑛,𝑡 ( 𝜖 𝑛,𝑡 , 𝑠 𝑡 ) ,and the discount factor was set to 𝛾 = . Signal Implementation . The signal is represented as a 𝐺 -dimensional one-hot encoded vector, where the high bit is shiftedperiodically. The initial value was chosen at random at the be-ginning of each episode to avoid bias towards particular values.Throughout this paper, the term no signal will be used interchange-ably to a unit signal size 𝐺 =

1, since a signal of size 1 in one-hotencoding is just a constant input that yields no information. Weevaluated signals of varying cardinality (see Section 3.6).

Termination Condition . An episode terminates when ei-ther (a) the resource stock falls below a threshold 𝛿 = − , or (b)a ﬁxed number of time-steps 𝑇 𝑚𝑎𝑥 =

500 is reached. We trainedour agents for a maximum of 5000 episodes, with the possibility of early stopping if both of the following conditions are satisﬁed: (i) aminimum of 95% of the maximum episode duration (i.e., 475 time-steps) is reached for 200 episodes in a row, and, (ii) the averagetotal reward obtained by agents in each episode of the aforemen-tioned 200 episodes does not change by more than 5%. In case ofearly stopping, the metric values for the remainder of the episodesare extrapolated as the average of the last 200 episodes, in order toproperly average across trials.

Measuring The Influence of the Signal . It is importantto have a quantitative measure of the inﬂuence of the introducedsignal. As such, we adapted the Causal Inﬂuence of Communica-tion (CIC) [31] metric, initially designed to measure positive listen-ing in emergent inter-agent communication. The CIC is calculatedusing the mutual information between the signal and the agent’saction. Please see Section D.3 for a complete description.

Reproducibility, Reporting of Results, Limitations . Re-producibility is a major challenge in (MA)DRL due to diﬀerentsources of stochasticity, e.g., hyper-parameters, model architecture,implementation details, etc. [14, 19, 20]. To minimize those sources,the implementation was done using RLlib , an open-source libraryfor MADRL [29]. We refer the reader to Section D for a descriptionof the architecture and hyper-parameters.All simulations were repeated times and the reported resultsare the average values of the last 10 episodes over those trials (ex-cluding Fig. 7 which depicts a representative trial). (MA)DRL alsolacks common practices for statistical testing [19, 20]. In this work,we opted to use the Student’s T-test [45] due to it’s robustness [9].Nearly all of the reported results have p-values < . We present the result from a systematic evaluation of the proposedapproach on a wide variety of environmental settings ( 𝑀 𝑠 ∈ [ . , . ] ,i.e., ranging from way below the limit of immediate depletion, 𝑀 𝐿𝐼𝐷 𝑠 ≈ .

63, to above the limit of sustainable harvesting, 𝑀 𝐿𝑆𝐻 𝑠 =

1) andpopulation size ( 𝑁 ∈ [ , ] ). Due to lack of space we only presentthe most relevant results; see Section E for a complete report (e.g.,results tables, fairness, small population sizes, etc.).In the majority of the results, we study the inﬂuence of a signalof cardinality 𝐺 = 𝑁 compared to no signal ( 𝐺 = 𝐺 = 𝑁 . Sustainability . We declare a strategy ‘sustainable’, iﬀ theagents reach the maximum episode duration (500 steps), i.e., theydo not deplete the resource. Fig. 2 depicts the achieved episode https://docs.ray.io/en/latest/rllib.html a) 𝑁 = (b) 𝑁 = (c) 𝑁 = (d) 𝑁 = Figure 2: Episode length, with and without the signal ( 𝐺 = 𝑁 ), for environments of decreasing diﬃculty (increasingequilibrium stock multiplier 𝑀 𝑠 ). length – with and without the presence of a signal ( 𝐺 = 𝑁 ) – forenvironments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). Theintroduction of the signal signiﬁcantly increases the range of en-vironments ( 𝑀 𝑠 ) where sustainability can be achieved. Assumingthat 𝑀 𝑠 ∈ [ , ] – since for 𝑀 𝑠 ≥ − 𝑀 𝑠 values. Moreover, as the number ofagents increases ( 𝑁 =

32 & 64), depletion is avoided in non-trivial 𝑀 𝑠 values only with the introduction of the signal . Finally, note thatthe 𝑀 𝑠 value where a sustainable strategy is found increases with 𝑁 , which demonstrates that the diﬃculty of the problem increasessuperlinearly to 𝑁 (given that 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 𝑁 ). Social welfare . Reaching a sustainable strategy – i.e., avoid-ing resource depletion – is only one piece of the puzzle; an agent’srevenue depends on the harvest (Eq. 1), which in turns depends onthe catchability coeﬃcient (Eq. 2). Thus, in order to achieve a highsocial welfare (sum of utilities, i.e., Í 𝑛 ∈N 𝑟 𝑛 (·) ), the agents needto learn policies that balance the trade-oﬀ between maintaining ahigh stock (which ensues a high catchability coeﬃcient), and yield-ing a large harvest (which results to a higher reward). This problembecomes even more apparent as resources become more abundant(i.e., for 𝑀 𝑠 = ± 𝑥 , i.e., close to the limit of sustainable harvesting(below or, especially, above ), see Section 2.2.2). In these settings, itis easy to ﬁnd a sustainable strategy; a myopic best-response strat-egy (harvesting at maximum eﬀort) by all agents will not depletethe resource. Yet, it will result in low social welfare (SW).Fig. 3 depicts the relative diﬀerence in SW, in a setting with andwithout the signal ( ( 𝑆𝑊 𝐺 = 𝑁 − 𝑆𝑊 𝐺 = )/ 𝑆𝑊 𝐺 = , where 𝑆𝑊 𝐺 = 𝑋 de-notes the SW achieved using a signal of cardinality 𝑋 ), for envi-ronments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ) and vary-ing population size ( 𝑁 ∈ [ , ] ). To improve readability, changesgreater than 100% are shown with numbers on the top of the bars. Figure 3: Relative diﬀerence in social welfare (SW) whensignal of cardinality 𝐺 = 𝑁 is introduced ( ( 𝑆𝑊 𝐺 = 𝑁 − 𝑆𝑊 𝐺 = )/ 𝑆𝑊 𝐺 = , where 𝑆𝑊 𝐺 = 𝑋 denotes the SW achieved usinga signal of cardinality 𝑋 ), for environments of decreasingdiﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ) and varying population size( 𝑁 ∈ [ , ] ). To improve readability, changes greater than100% are shown with numbers on the top of the bars.Figure 4: Relative diﬀerence in convergence time with the in-troduction of a signal ( ( 𝐶𝑇 𝐺 = 𝑁 − 𝐶𝑇 𝐺 = )/ 𝐶𝑇 𝐺 = , where 𝐶𝑇 𝐺 = 𝑋 denotes the time until convergence when using a signal ofcardinality 𝑋 ), for environments of decreasing diﬃculty (in-creasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ) and varying population size ( 𝑁 ∈ [ , ] ) Given the various sources of stochasticity, we opted to omit set-tings in which agents were not able to reach an episode durationof more than 10 time-steps (either with or without the signal).The presence of the signal results in a signiﬁcant improvementin SW. Speciﬁcally, we have an average of 258% improvement acrossall the depicted settings in Fig. 3, while the maximum improve-ment is 3306%. These improvements stem from (i) achieving moresustainable strategies, and (ii) improved cooperation. The formerresults in higher rewards due to longer episodes in settings wherethe strategies without the signal deplete the resource. The latterallows to avoid over-harvesting, which results in higher catchabil-ity coeﬃcient, in settings where both strategies (with, or withoutthe signal) are sustainable. The contribution of the signal is muchmore pronounced under scarcity: the diﬀerence in achieved SW de-creases as 𝑀 𝑠 increases, eventually becoming less than 10% ( 𝑀 𝑠 > 𝑁 = 𝑀 𝑠 > . 𝑁 =

32 & 64). This suggests thatthe proposed approach is of high practical value in environmentswhere resources are scarce (like most real-world applications), aclaim that we further corroborate in Sections 3.5 and 3.7. The averaging is performed across the entire range of the depicted 𝑀 𝑠 ∈ [ . , . ] ,including the really scarce environments of 𝑀 𝑠 = . and . where there is nosustainable strategy with or without the signal and, thus, the change is zero. igure 5: Average (over agents and trials) CIC values (nor-malized) vs. the equilibrium stock multiplier 𝑀 𝑠 , for popu-lation/signal size 𝑁 = 𝐺 ∈ { , , , } . The second major inﬂuence of the introduction of the proposed sig-nal – besides the sustainability and eﬃciency of the learned strate-gies – is on the convergence time. Let the system be consideredconverged when the global state does not change signiﬁcantly. Asa practical way to pinpoint the time of convergence, we used the‘Termination Criterion’ of Section 3.1.4. Fig. 4 depicts the relativediﬀerence in convergence time with the introduction of a signal( ( 𝐶𝑇 𝐺 = 𝑁 − 𝐶𝑇 𝐺 = )/ 𝐶𝑇 𝐺 = , where 𝐶𝑇 𝐺 = 𝑋 denotes the time untilconvergence, in 𝑋 ),for environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 )and varying population size ( 𝑁 ∈ [ , ] ). We have omitted thesettings in which agents were not able to reach an episode dura-tion of more than 10 time-steps (either with or without the signal).There is a disjoint eﬀect of the signal on the convergence speed.Up to the limit of sustainable harvesting ( 𝑀 𝑠 ≤ across all the depicted settings including the ones withno improvement, and up to 53%). This is vital, as the majority of real-world problems involve managing scarce resources . On the otherhand, for 𝑀 𝑠 >

1, i.e., settings with abundant resources, the systemconverges faster without the signal (14% slower with the signal onaverage, across all the depicted settings). One possible explanationis that as resources become more abundant, it is harder (impossiblefor 𝑀 𝑠 >

1) for agents to deplete them. Therefore the learning ismore eﬃcient – and the convergence is faster – since the episodestend to last longer (without needing the signal). Moreover, havingan abundance of resources decouples the eﬀects of the agents’ ac-tions to each other, reducing the variance, and again making easingthe learning process without the signal.

The results presented so far provide a qualitative measure of theinﬂuence of the introduced signal through the improvement onsustainability, social welfare, and convergence speed. They also in-dicate a decrease on the inﬂuence of the signal as resources be-come abundant. The question that naturally arises is: how muchdo agents actually take the signal into account in their policies?To answer this question, Fig. 5 depicts the CIC values – a quantita-tive measure of the inﬂuence of the introduced signal (see Section3.1.5) – versus increasing values of 𝑀 𝑠 (i.e, increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ,or more abundant resources), for population/signal size 𝑁 = 𝐺 ∈{ , , , } . The values are averaged across the 8 trials and theagents, and are normalized with respect to the maximum value for (a)(b) Figure 6: Achieved social welfare (Fig. 6a) and convergencetime (Fig. 6b) for diﬀerent signals of cardinality ( 𝐺 ) , 𝑁 = , , 𝑁 = , , and 𝑁 = (for varying resource levels 𝑀 𝑠 ). each population . Higher CIC values indicate a higher causal inﬂu-ence of the signal.CIC is low for the trials in which a sustainable strategy could notbe found ( 𝑀 𝑠 = . − . 𝑁 =

8, 16, 𝑀 𝑠 = . − . 𝑁 =

32, and 𝑀 𝑠 = . − . 𝑁 =

64, see Fig. 2). In cases where a sustainablestrategy was reached (e.g., 𝑀 𝑠 ≥ . 𝑁 = 𝑀 𝑠 increases. The harder the coordination problem,the more the agents rely on the environmental signal.

Up until now we have evaluated the presence (or lack thereof) ofan environmental signal of cardinality equal to the population size( 𝐺 = 𝑁 ). This requires exact knowledge of 𝑁 , thus it is interest-ing to test the robustness of the proposed approach under varyingsignal sizes. As a representative test-case, we evaluated diﬀerentsignals of cardinality 𝐺 = 𝑁 , 23, 𝑁 , 41, and 𝑁 for 𝑁 =

32 andmoderate scarcity for the resource ( 𝑀 𝑠 values of 0 .

7, 0 . . not multiples of 𝑁 ). Fig. 6 depicts the achieved social welfare andconvergence time under the aforementioned settings.Starting with Fig. 6a we can see that the SW increases with thesignal cardinality. Speciﬁcally, we have 263%, 255%, 341%, 416%,and 474% improvement on average across the three 𝑀 𝑠 values for 𝐺 = 𝑁 , 23, 𝑁 , 41, and 𝑁 , respectively. We hypothesize that theimprovement stems from an increased joint strategy space that thelarger signal size allows. A signal size larger than 𝑁 can also allowthe emergence of ‘rest’ (fallow) periods – signal values where themajority of agents harvests at really low eﬀorts. This would allowthe resource to recuperate, and increase the SW through a highercatchability coeﬃcient. See Section 3.7 / Fig. 7b for an example.Regarding the convergence speed (Fig. 6b), we have 22%, 38%,36%, 41%, and 36% improvement on average (across 𝑀 𝑠 values).These results showcase that the introduction of the signal itself– regardless of its cardinality or, more generally, its temporal repre-sentative power – provides a clear beneﬁt to the agents in terms of For the absolute values please refer to Table 12. Fig. 5 shows trends across 𝑀 𝑠 values– not between populations sizes (due to the normalization). W and convergence speed. This greatly improves the real-worldapplicability of the proposed technique, as the the knowledge of theexact population size is not required ; instead the agents can opt toselect any signal available in their environment . Moreover, thesignal cardinality can also be considered as a design choice, de-pending on the requirements and limitations of the system. Qualitative Analysis . We have seen that the introductionof an arbitrary signal facilitates cooperation and the sustainableharvesting. But do temporal conventions actually emerge?Fig. 7a presents an example of the evolution of the agents’ strate-gies for each signal value for a population of 𝑁 =

4, signal size 𝐺 = 𝑁 =

4, and equilibrium stock multiplier 𝑀 𝑠 = . 𝑛 𝑖 ), whileeach column represents a signal value (value 𝑔 𝑗 ). Each line rep-resents the average eﬀort the agent exerts on that speciﬁc signalvalue – calculated by averaging the actions of the agent in eachcorresponding signal value across the episode.We can see a clear temporal convention emerging: at signalvalue 𝑔 (ﬁrst column), only agents 𝑛 and 𝑛 harvest (ﬁrst andthird row), at 𝑔 , 𝑛 and 𝑛 harvest, at 𝑔 , 𝑛 and 𝑛 harvest, and, ﬁ-nally, at 𝑔 , 𝑛 and 𝑛 . Contrary to that, in a sustainable joint strat-egy without the use of the signal, every agent harvests at everytime-step with an average (across all agents) eﬀort of ≈

40% (forthe same setting of 𝑁 = 𝑀 𝑠 = . Having all agents harvest-ing at every time-step makes coordination increasingly harder as weincrease the population size , mainly due to the non-stationarity ofthe environment (high variance) and the global exploration prob-lem.

Access Rate . In order to facilitate a systematic analysis ofthe accessing patterns, we discretized the agents into three bins:agents harvesting with eﬀort 𝜖 ∈ [ − . ) (‘idle’), [ . − . ) (‘moderate’), and [ . − ] (‘active’). Then we counted the aver-age number of agents in each bin at the ﬁrst equilibrium stockmultiplier ( 𝑀 𝑠 ) where a non-depleting strategy was achieved ineach setting. Without a signal, either the majority of the agentsare ‘moderate’ harvesters (speciﬁcally 84% for 𝑁 = all of them are ‘active’ harvesters (100% for 𝑁 =

32 and 64). Withthe signal, we have a clear separation into ‘idle’ and ‘active’: (‘idle’,‘active’) = ( , ) , ( , ) , ( , ) , ( , ) , for 𝑁 = . It is apparent that with the signal theagents learn a temporal convention; only a minority is ‘active’ pertime-step , allowing to maintaining a healthy stock and reach sus-tainable strategies of high social welfare . Fallowing . A more interesting joint strategy can be seenin Fig. 7b ( 𝑁 = 𝑀 𝑠 = . 𝐺 = 𝑁 =

3. We can seethat agents harvest alternatingly in the ﬁrst two signal values, andrest on the third ( fallow period ), potentially to allow resources to re-plenish and consequently obtain higher rewards in the future due The signal is represented as a one-hot vector, i.e., Fig. 6 shows that a network with32 inputs can work for population sizes 𝑁 ∈ [ , ] , or equivalently, that agents ina population of size 𝑁 = can use networks with − inputs for the signal. The setting with 𝑁 = was run with 𝑟 = in both cases (with and without thesignal). See Section E for more information. (a)(b) Figure 7: Evolution of the agents’ strategies for each signalvalue, smoothed over 50 episodes. Fig. 7a pertains to a pop-ulation of 𝑁 = and signal size 𝐺 = 𝑁 = , while Fig. 7b toa population of 𝑁 = and signal size 𝐺 = 𝑁 = . In bothcases the equilibrium stock multiplier is 𝑀 𝑠 = . . Each rowrepresents an agent ( 𝑛 𝑖 ), while each column a signal value( 𝑔 𝑗 ). Each line depicts the average eﬀort the agent exerts onthat speciﬁc signal value – calculated by averaging the ac-tions of the agent in each corresponding signal value acrossthe episode. Shaded areas represent one standard deviation. to a higher catchability coeﬃcient. This also resembles the optimal(bang-bang) harvesting strategy of Theorem 2.1. The challenge to cooperatively solve ‘the tragedy of the commons’remains as relevant now as when it was ﬁrst introduced by Hardinin 1968. Sustainable development and avoidance of catastrophicscenarios in socio-ecological systems – like the permanent deple-tion of resources, or the extinction of endangered species – con-stitute critical open problems. To add to the challenge, real-worldproblems are inherently large in scale and of low observability.This ampliﬁes traditional problems in multi-agent learning, suchas the global exploration and the moving-target problem. Earlierwork in common-pool resource appropriation utilized intrinsic orextrinsic incentives (e.g., reward or opponent shaping). Yet, suchtechniques need to be designed for the problem at hand and/orrequire communication or observability of states/actions, whichis not always feasible (e.g., in commercial ﬁsheries, the stock orharvesting eﬀorts can not be directly observed). Humans on theother hand show a remarkable ability to self-organize and resolvecommon-pool resource dilemmas, often without any extrinsic in-centive mechanism or communication . Social conventions and these of auxiliary environmental information constitute key mecha-nisms for the emergence of cooperation under low observability. Inthis paper, we demonstrate that utilizing such environmental sig-nals – which are amply available – is a simple, yet powerful androbust technique, to foster cooperation in large-scale, low observ-ability, and high-stakes environments. We are the ﬁrst to tackle arealistic CPR appropriation scenario modeled on real-world com-mercial ﬁsheries and under low observability. Our approach avoidspermanent depletion in a wider (up to 300%) range of settings,while achieving higher social welfare (up to 3306%) and conver-gence speed (up to 53%).

REFERENCES [1] Robert J. Aumann. 1974. Subjectivity and correlation in randomizedstrategies.

Journal of Mathematical Economics

1, 1 (1974), 67 – 96.https://doi.org/10.1016/0304-4068(74)90037-8[2] Holly P Borowski, Jason R Marden, and Jeﬀ S Shamma. 2014. Learning eﬃcientcorrelated equilibria. In

Decision and Control (CDC), 2014 IEEE 53rd Annual Con-ference on . IEEE, 6836–6841.[3] David V. Budescu, Wing Tung Au, and Xiao-Ping Chen. 1997. Eﬀects of Protocolof Play and Social Orientation on Behavior in Sequential Resource Dilemmas.

Organizational Behavior and Human Decision Processes

69, 3 (1997), 179 – 193.https://doi.org/10.1006/obhd.1997.2684[4] John B Calhoun. 1962. Population density and social pathology.

Scientiﬁc Amer-ican

Proceedings of the Royal Society of Medicine

66, 1P2 (1973),80–88. https://doi.org/10.1177/00359157730661P202[6] Marco Casari and Charles R Plott. 2003. Decentralized management of com-mon property resources: experiments with a centuries-old institution.

Journalof Economic Behavior & Organization

51, 2 (2003), 217–247.[7] Ludek Cigler and Boi Faltings. 2013. Decentralized anti-coordination throughmulti-agent learning.

Journal of Artiﬁcial Intelligence Research

47 (2013), 441–473.[8] Colin W Clark. 2006.

The worldwide crisis in ﬁsheries: economic models and hu-man behavior . Cambridge University Press.[9] Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. 2019. A Hitchhiker’sGuide to Statistical Comparisons of Reinforcement Learning Algorithms. arXivpreprint arXiv:1904.06979 (2019).[10] Giuseppe Cuccu, Julian Togelius, and Philippe Cudré-Mauroux. 2019. PlayingAtari with Six Neurons. In

Proceedings of the 18th International Conference onAutonomous Agents and MultiAgent Systems (Montreal QC, Canada) (AAMAS’19) . International Foundation for Autonomous Agents and Multiagent Systems,Richland, SC, 998–1006.[11] Panayiotis Danassis and Boi Faltings. 2019. Courtesy As a Means to Coordinate.In

Proceedings of the 18th International Conference on Autonomous Agents andMultiAgent Systems (Montreal QC, Canada) (AAMAS ’19) . International Foun-dation for Autonomous Agents and Multiagent Systems, Richland, SC, 665–673.http://dl.acm.org/citation.cfm?id=3306127.3331754[12] Florian K Diekert. 2012. The tragedy of the commons from a game-theoreticperspective.

Sustainability

4, 8 (2012), 1776–1786.[13] Wandi Ding and Suzanne Lenhart. 2010. Introduction to Optimal Control forDiscrete Time Models with an Application to Disease Modeling.. In

ModelingParadigms and Analysis of Disease Trasmission Models . 109–120.[14] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, FirdausJanoos, Larry Rudolph, and Aleksander Madry. 2020. Implementation Mattersin Deep RL: A Case Study on PPO and TRPO. In

International Conference onLearning Representations . https://openreview.net/forum?id=r1etN1rtPB[15] Sriram Ganapathi Subramanian, Pascal Poupart, Matthew E. Taylor, and NidhiHegde. 2020. Multi Type Mean Field Reinforcement Learning. In

Proceedingsof the 19th International Conference on Autonomous Agents and MultiAgent Sys-tems (Auckland, New Zealand) (AAMAS ’20) . International Foundation for Au-tonomous Agents and Multiagent Systems, Richland, SC, 411–419.[16] Corrado Gini. 1912. Variabilità e mutabilità.

Reprinted in Memorie di metodolog-ica statistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi (1912).[17] Garrett Hardin. 1968. The tragedy of the commons. science

Econometrica

Autonomous Agents andMulti-Agent Systems

33, 6 (2019), 750–797.[21] Edward Hughes, Joel Z. Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez Guz-man, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee, RaphaelKoster, Heather Roﬀ, and Thore Graepel. 2018. Inequity Aversion Improves Co-operation in Intertemporal Social Dilemmas. In

Proceedings of the 32nd Interna-tional Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS’18) . Curran Associates Inc., Red Hook, NY, USA, 3330–3340.[22] Raj Jain, Dah-Ming Chiu, and W. Hawe. 1998. A Quantitative Measure Of Fair-ness And Discrimination For Resource Allocation In Shared Computer Systems.

CoRR cs.NI/9809099 (1998). http://arxiv.org/abs/cs.NI/9809099[23] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Or-tega, Dj Strouse, Joel Z. Leibo, and Nando De Freitas. 2019. Social Inﬂuence asIntrinsic Motivation for Multi-Agent Deep Reinforcement Learning (Proceedingsof Machine Learning Research, Vol. 97) , Kamalika Chaudhuri and Ruslan Salakhut-dinov (Eds.). PMLR, Long Beach, California, USA, 3040–3049.[24] Peter Kollock. 1998. Social dilemmas: The anatomy of cooperation.

Annualreview of sociology

24, 1 (1998), 183–214.[25] Raphael Koster, Dylan Hadﬁeld-Menell, Gillian K. Hadﬁeld, and Joel Z. Leibo.2020. Silly Rules Improve the Capacity of Agents to Learn Stable Enforcementand Compliance Behaviors. In

Proceedings of the 19th International Conference onAutonomous Agents and MultiAgent Systems (Auckland, New Zealand) (AAMAS’20) . International Foundation for Autonomous Agents and Multiagent Systems,Richland, SC, 1887–1888.[26] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Grae-pel. 2017. Multi-agent reinforcement learning in sequential social dilemmas.In

Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Sys-tems . Int. Foundation for Autonomous Agents and Multiagent Systems, 464–473.[27] Suzanne Lenhart and John T Workman. 2007.

Optimal control applied to biologi-cal models . CRC press.[28] David Lewis. 2008.

Convention: A philosophical study . John Wiley & Sons.[29] Eric Liang, Richard Liaw, Robert Nishihara,Philipp Moritz, Roy Fox, Joseph Gon-zalez, Ken Goldberg, and Ion Stoica. 2017. Ray RLlib: A Composable and Scal-able Reinforcement Learning Library. In

Deep Reinforcement Learning sympo-sium (DeepRL @ NeurIPS) .[30] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, TomErez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous controlwith deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).[31] Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin.2019. On the Pitfalls of Measuring Emergent Communication. In

Proceedings ofthe 18th International Conference on Autonomous Agents and MultiAgent Systems (Montreal QC, Canada) (AAMAS ’19) . International Foundation for AutonomousAgents and Multiagent Systems, Richland, SC, 693–701.[32] Andrei Lupu and Doina Precup. 2020. Gifting in Multi-Agent ReinforcementLearning. In

Proceedings of the 19th International Conference on AutonomousAgents and MultiAgent Systems (Auckland, New Zealand) (AAMAS ’20) . Interna-tional Foundation for Autonomous Agents and Multiagent Systems, Richland,SC, 789–797.[33] Laetitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. 2012. Indepen-dent reinforcement learners in cooperative Markov games: a survey regardingcoordination problems.

The Knowledge Engineering Review

27, 1 (2012), 1–31.https://doi.org/10.1017/S0269888912000057[34] Mihail Mihaylov, Karl Tuyls, and Ann Nowé. 2014. A decentralized approach forconvention emergence in multi-agent systems.

Autonomous Agents and Multi-Agent Systems

28, 5 (2014), 749–778.[35] Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V Vazirani. 2007.

Algo-rithmic game theory . Vol. 1. Cambridge University Press Cambridge.[36] Elinor Ostrom. 1999. Coping with tragedies of the commons.

Annual review ofpolitical science

2, 1 (1999), 493–535.[37] Elinor Ostrom, Roy Gardner, James Walker, and Jimmy Walker. 1994.

Rules,games, and common-pool resources . University of Michigan Press.[38] Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, andThore Graepel. 2017. A multi-agent reinforcement learning model of common-pool resource appropriation. In

Advances in Neural Information Processing Sys-tems . 3643–3652.[39] Alexander Peysakhovich and Adam Lerer. 2018. Consequential-ist conditional cooperation in social dilemmas with imperfect in-formation. In

International Conference on Learning Representations .https://openreview.net/forum?id=BkabRiQpb[40] Alexander Peysakhovich and Adam Lerer.2018. Prosocial Learning Agents SolveGeneralized Stag Hunts Better than Selﬁsh Ones. In

Proceedings of the 17th Inter-national Conference on Autonomous Agents and MultiAgent Systems (Stockholm,Sweden) (AAMAS ’18) . International Foundation for Autonomous Agents andMultiagent Systems, Richland, SC, 2043–2044.[41] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.2017. Proximal Policy Optimization Algorithms.

CoRR abs/1707.06347 (2017).arXiv:1707.06347 http://arxiv.org/abs/1707.0634742] L. S. Shapley. 1953. Stochastic Games.

Proceedings of the National Academy ofSciences

39, 10 (1953), 1095–1100. https://doi.org/10.1073/pnas.39.10.1095[43] Yoav Shoham and Moshe Tennenholtz. 1995. On social laws for artiﬁcialagent societies: oﬀ-line design.

Artiﬁcial Intelligence

73, 1 (1995), 231 – 252.https://doi.org/10.1016/0004-3702(94)00007-N Computational Research on In-teraction and Agency, Part 2.[44] Peter Stone, Gal A. Kaminka, Sarit Kraus, and Jeﬀrey S. Rosenschein. 2010. AdHoc Autonomous Agent Teams: Collaboration without Pre-Coordination. In

Pro-ceedings of the Twenty-Fourth Conference on Artiﬁcial Intelligence .[45] Student. 1908. The probable error of a mean.

Biometrika (1908), 1–25.[46] A. Walker and M. J. Wooldridge. 1995. Understanding the Emergence ofConventions in Multi-Agent Systems. In

ICMAS95 . San Francisco, CA, 384–389.http://groups.lis.illinois.edu/amag/langev/paper/walker95understandingThe.html[47] Jane X. Wang, Edward Hughes, Chrisantha Fernando, Wojciech M. Czarnecki,Edgar A. Duéñez Guzmán, and Joel Z. Leibo. 2019. Evolving Intrinsic Motiva-tions for Altruistic Behavior. In

Proceedings of the 18th International Conferenceon Autonomous Agents and MultiAgent Systems (Montreal QC, Canada) (AAMAS’19) . International Foundation for Autonomous Agents and Multiagent Systems,Richland, SC, 683–692.[48] Rudolf Paul Wiegand and Kenneth A. Jong. 2004.

An Analysis of CooperativeCoevolutionary Algorithms . Ph.D. Dissertation. USA. AAI3108645.[49] H Peyton Young. 1996. The economics of convention.

The Journal of EconomicPerspectives

10, 2 (1996), 105–122.

A APPENDIXA.1 Contents

In this appendix we include several details that have been omittedfrom the main text. In particular:- In Section B, we prove Theorem 2.1.- In Section C, we investigate the range of feasible values forthe growth rate, 𝑟 .- In Section D, we provide details on the agent architecture,the introduced signal, the CIC metric, and the fairness in-dices.- In Section E, we provide a thorough account of the simula-tion results. B PROOF OF THEOREM 2.1

For completeness, we re-state the control problem and Theorem2.1. We want to ﬁnd a piecewise continuous control 𝐸 𝑡 , so as tomaximize the total revenue for a given episode duration 𝑇 (Eq. 11,where 𝑈 𝑡 ( 𝐸 𝑡 ) is the cumulative revenue at time-step 𝑡 , given byEq. 12). The maximization problem can be solved using OptimalControl Theory [13, 27]. max 𝐸 𝑡 𝑇 Õ 𝑡 = 𝑈 𝑡 ( 𝐸 𝑡 ) subject to 𝑠 𝑡 + = 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 )) (11) 𝑈 𝑡 ( 𝐸 𝑡 ) = 𝑝 𝑡 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ) − 𝑐 𝑡 (12)The optimal control is given by the following theorem: Theorem 2.1.

The optimal control variables 𝐸 ∗ 𝑡 that solves themaximization problem of Eq. 11 given the model dynamics describedin Section 2.2 is given by the following equation, where 𝜆 𝑡 are theadjoint variables of the Hamiltonians: ‘Optimal’ is used in a technical sense, as the strategy that maximizes the revenuesubject to the model equations, and it does not carry any moralistic implications. 𝐸 ∗ 𝑡 + = ( 𝐸 𝑚𝑎𝑥 , if ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ))) ≥ , if ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ))) < Proof.

In order to decouple the state ( 𝑠 𝑡 , the current resourcestock) and the control ( 𝐸 𝑡 ) and simplify the calculations, we re-sort to a change of variables. We deﬁne the new state 𝑤 𝑡 as theremaining stock after harvest at time-step 𝑡 : 𝑤 𝑡 , 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ) Therefore, 𝑠 𝑡 + = 𝑤 𝑡 + + 𝐻 ( 𝐸 𝑡 + , 𝑠 𝑡 + ) (13)and 𝑠 𝑡 + = 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 )) = 𝐹 ( 𝑤 𝑡 ) (14)Using Eq. 13 and 14, we can write the new state equation as: 𝑤 𝑡 + = 𝐹 ( 𝑤 𝑡 ) − 𝐻 ( 𝐸 𝑡 + , 𝐹 ( 𝑤 𝑡 )) (15)In the current form of the state equation (Eq. 15), the harvestedresources appear outside the nonlinear growth function 𝐹 ( . ) , mak-ing the following analysis signiﬁcantly simpler.Under optimal control, the resource will not get depleted beforethe end of the horizon 𝑇 , thus 𝑞 ( 𝑠 𝑡 ) 𝐸 𝑡 ≤ 𝑠 𝑡 , ∀ 𝑡 < 𝑇 . We canrewrite the total harvest as: 𝐻 ( 𝐸 𝑡 + , 𝐹 ( 𝑤 𝑡 )) = 𝑞 ( 𝐹 ( 𝑤 𝑡 )) 𝐸 𝑡 + (16)The state equation (Eq. 15) becomes: 𝑤 𝑡 + = 𝐹 ( 𝑤 𝑡 ) − 𝑞 ( 𝐹 ( 𝑤 𝑡 )) 𝐸 𝑡 + and the optimization problem: 𝑚𝑎𝑥 𝑇 − Õ 𝑡 = − ( 𝑝 𝑡 + 𝑞 ( 𝐹 ( 𝑤 𝑡 )) 𝐸 𝑡 + − 𝑐 𝑡 + ) 𝑠𝑢𝑏 𝑗𝑒𝑐𝑡 𝑡𝑜 𝑤 𝑡 + = 𝐹 ( 𝑤 𝑡 ) − 𝑞 ( 𝐹 ( 𝑤 𝑡 )) 𝐸 𝑡 + Let 𝑤 − = 𝑠 = 𝑆 𝑒𝑞 . Solving the optimization problem is equivalentto ﬁnding the control that optimizes the Hamiltonians [13, 27]. Let 𝜆 = ( 𝜆 − , 𝜆 , . . . , 𝜆 𝑇 − ) denote the adjoint function. The Hamilton-ian at time-step 𝑡 is given by: H 𝑡 = 𝑝 𝑡 + 𝑞 ( 𝐹 ( 𝑤 𝑡 )) 𝐸 𝑡 + − 𝑐 𝑡 + + 𝜆 𝑡 + ( 𝐹 ( 𝑤 𝑡 ) − 𝑞 ( 𝐹 ( 𝑤 𝑡 )) 𝐸 𝑡 + ) = ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑤 𝑡 )) 𝐸 𝑡 + − 𝑐 𝑡 + + 𝜆 𝑡 + 𝐹 ( 𝑤 𝑡 ) (17)The adjoint equations are given by [13]: In accordance to the literature on Optimal Control Theory [27], ‘state’ in the con-text of the proof refers to the variable describing the the behavior of the underlyingdynamical system, and ‘control’ refers to the input function used to steer the state ofthe system. Let us assume this is not the case and the optimal strategy would deplete the stock atcertain time-step 𝑇 𝑑𝑒𝑝 . That means that rewards are and the optimal 𝐸 𝑡 is arbitraryfor 𝑡 > 𝑇 𝑑𝑒𝑝 . Using the modiﬁed equation for the total harvest (Eq. 16), we allow 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ) > 𝑠 𝑡 or 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 ) < . This would lead to 𝑠 𝑡 + = 𝐹 ( 𝑠 𝑡 − 𝐻 ( 𝐸 𝑡 , 𝑠 𝑡 )) < ∀ 𝑡 > 𝑇 𝑑𝑒𝑝 , i.e., the stock would become negative. In such a case, any positive eﬀortwould decrease the revenue (since it would result in a negative harvest), thus theoptimal strategy would be to set 𝐸 𝑡 = . Thus, using the modiﬁed equation for thetotal harvest (Eq. 16), does not change the optimal solution. a) 𝑟 = (b) 𝑟 = (c) 𝑟 = − 𝑊 − (− 𝑒 ) ≈ . (d) 𝑟 = Figure 8: Plot of the spawner-recruit function 𝐹 (·) for various growth rates: 𝑟 = , , − 𝑊 − (− 𝑒 ) ≈ . , and (Fig. 8a, 8b, 8c, and8d, respectively). The 𝑥 -axis denotes the current stock level ( 𝑠 𝑡 ), while the 𝑦 -axis depicts the stock level on the next time-step,assuming no harvest (i.e., 𝑠 𝑡 + = 𝐹 ( 𝑠 𝑡 ) ). The dashed line indicates a stock level equal to 𝑆 𝑒𝑞 . 𝜆 𝑡 = 𝜕 H 𝑡 𝜕𝑤 𝑡 𝜆 𝑇 = 𝜕 H 𝑡 𝜕𝑢 𝑡 = 𝑢 𝑡 = 𝑢 ∗ 𝑡 where 𝑢 𝑡 is the control input, which corresponds to 𝐸 𝑡 + in ourformulation. The last condition corresponds to the maximizationof the Hamiltonian H 𝑡 in Eq. 17 for all time-steps 𝑡 [27, Chap-ter 23]. In our case, Eq. 17 is linear in 𝐸 𝑡 + with coeﬃcient ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑤 𝑡 )) . Therefore, the optimal sequence of 𝐸 𝑡 + that maxi-mizes Eq. 17 is given based on the sign of the coeﬃcient: 𝐸 ∗ 𝑡 + = ( 𝐸 𝑚𝑎𝑥 , if ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑤 𝑡 )) ≥ , if ( 𝑝 𝑡 + − 𝜆 𝑡 + ) 𝑞 ( 𝐹 ( 𝑤 𝑡 )) < 𝜆 𝑡 , current state 𝑤 𝑡 , and price 𝑝 𝑡 . (cid:3) C GROWTH RATE

The growth rate, 𝑟 , plays a signiﬁcant role in the stability of thestock dynamics. A high growth rate – and subsequently high pop-ulation density – can even lead to the extinction of the populationdue to the collapse in behavior from overcrowding (a phenomenonknown as ‘behavioral sink’ [4, 5]). This is reﬂected by the spawner-recruit function (Eq. 4) in our model. As depicted in Fig. 8, thehigher the growth rate, 𝑟 , the more skewed the stock dynamics.Speciﬁcally, Fig. 8 plots the spawner-recruit function 𝐹 (·) for var-ious growth rates: 𝑟 =

1, 2, − 𝑊 − (− 𝑒 ) ≈ . 𝑥 -axis denotes the current stock level( 𝑠 𝑡 ), while the 𝑦 -axis depicts the stock level on the next time-step,assuming no harvest (i.e., 𝑠 𝑡 + = 𝐹 ( 𝑠 𝑡 ) ). The dashed line indicatesa stock level equal to 2 𝑆 𝑒𝑞 . For a growth rate of 𝑟 = 𝑠 𝑡 ) result to 𝑠 𝑡 + < 𝛿 →

0, i.e., permanent depletion of the resource). For thisreason, we want an unskewed growth model, speciﬁcally we wantthe stock to remain below two times the equilibrium stock point, i.e., 𝑠 𝑡 + ≤ 𝑆 𝑒𝑞 . For this reason, we need to bound the growthrate according to the following theorem: Theorem C.1.

For a continuous resource governed by the dynam-ics of Section 2.2, the stock value does not exceed the limit of 𝑆 𝑒𝑞 , if 𝑟 ∈ [− 𝑊 (− /( 𝑒 )) , − 𝑊 − (− /( 𝑒 ))] ≈ [ . , . ] , where 𝑊 𝑘 (·) is the Lambert 𝑊 function. Proof.

Let 𝑥 , 𝑠 𝑡 ≤ 𝑆 𝑒𝑞 for a time-step 𝑡 . We want 𝑠 𝑡 + ≤ 𝑆 𝑒𝑞 , thus we need to bound the maximum value of the spawner-recruit function, 𝐹 ( 𝑥 ) (Eq. 4).Taking the derivative: 𝜕𝜕𝑥 𝐹 ( 𝑥 ) = 𝑒 𝑟 (cid:16) − 𝑥𝑆𝑒𝑞 (cid:17) (cid:18) − 𝑟𝑥𝑆 𝑒𝑞 (cid:19) We have that: 𝜕𝜕𝑥 𝐹 ( 𝑥 ) = ⇒ 𝑥 = 𝑆 𝑒𝑞 𝑟 , 𝑟 ≠ 𝑆 𝑒𝑞 ≠ 𝐹 (cid:18) 𝑆 𝑒𝑞 𝑟 (cid:19) = 𝑆 𝑒𝑞 𝑟 𝑒 ( 𝑟 − ) We want to bound the maximum value: 𝐹 (cid:18) 𝑆 𝑒𝑞 𝑟 (cid:19) ≤ 𝑆 𝑒𝑞 ⇒ 𝑒 ( 𝑟 − ) 𝑟 ≤ ⇒ 𝑒 𝑟 − 𝑒𝑟 ≤ ⇒ − 𝑊 (cid:18) − 𝑒 (cid:19) ≤ 𝑟 ≤ − 𝑊 − (cid:18) − 𝑒 (cid:19) where 𝑊 𝑘 (·) is the Lambert 𝑊 function. − 𝑊 (− /( 𝑒 )) ≈ .

232 and − 𝑊 − (− /( 𝑒 )) ≈ . (cid:3) D MODELING DETAILSD.1 Agent Architecture Details

Recent work has demonstrated that code-level optimizations playan important role in performance, both in terms of achieved re-ward and underlying algorithmic behavior [14]. To minimize thosesources of stochasticity – and given that the focus of this work is in This limit is imposed by the chosen parameters of the model equations. able 1: List of hyper-parameters.Parameter Value

Learning Rate ( 𝛼 ) 0.0001Clipping Parameter 0.3Value Function Clipping Parameter 10.0KL Target 0.01Discount Factor ( 𝛾 ) 0.99GAE Parameter Lambda 1.0Value Function Loss Coeﬃcient 1.0Entropy Coeﬃcient 0.0the performance of the introduced technique and not of the train-ing algorithm – we opted to use RLlib as our implementationframework. Each agent uses a two-layer (64 neurons each) feed-forward neural network for the policy approximation. The poli-cies are trained using the Proximal Policy Optimization (PPO) algo-rithm [41]. All the hyper-parameters were left to the default valuesspeciﬁed in Ray and RLlib . For completeness, Table 1 presents alist of the most relevant of them. D.2 Introduced Signal: Implementation Details

The introduced signal was encoded as a 𝐺 -dimensional one-hotvector of ﬁxed size, in which the high bit is shifted periodically. Inparticular, its value at index 𝑖 at time-step 𝑡 is given by: ( 𝑚𝑜𝑑 ( 𝑡 − 𝑡 𝑖𝑛𝑖𝑡 , 𝐺 ) = 𝑖 𝑡 𝑖𝑛𝑖𝑡 is the random oﬀset determined at the beginning ofeach episode in order to avoid learning any bias towards certainsignal values. D.3 Causal Inﬂuence of Communication (CIC):Implementation Details

The Causal Inﬂuence of Communication (CIC) [31] estimates themutual information between the signal and the agent’s action. Themutual information between two random variables ˜ 𝑋 and ˜ 𝑌 is de-ﬁned as the reduction of uncertainty (measured in terms of entropy 𝐻 𝑆 (·) ) in the value of ˜ 𝑋 with the observation of ˜ 𝑌 : 𝐼 ( ˜ 𝑋, ˜ 𝑌 ) = 𝐼 ( ˜ 𝑌, ˜ 𝑋 ) = 𝐻 𝑆 ( ˜ 𝑋 ) − 𝐻 𝑆 ( ˜ 𝑋 | ˜ 𝑌 ) = 𝐸 ( 𝑙𝑜𝑔 𝑃 ˜ 𝑋, ˜ 𝑌 ( ˜ 𝑥, ˜ 𝑦 ) 𝑃 ˜ 𝑋 ( ˜ 𝑥 ) 𝑃 ˜ 𝑌 ( ˜ 𝑦 ) !) The pseudo-code for calculating the CIC for a single agent ispresented in Alg. 1. Note that the CIC implementation in [31] con-siders a multi-dimensional, one-hot, discrete action space with ac-cessible probabilities for every action, while in our case we havea single, continuous action (speciﬁcally, the eﬀort 𝜖 𝑛,𝑡 ). To solvethis problem, we discretize our action space into 𝑁 𝑏𝑖𝑛𝑠 intervalsbetween minimum ( E 𝑚𝑖𝑛 =

0) and maximum ( E 𝑚𝑎𝑥 =

1) eﬀort RLlib (https://docs.ray.io/en/latest/rllib.html) is an open-source library on top of Ray(https://docs.ray.io/en/latest/index.html) for Multi-Agent Deep Reinforcement Learn-ing [29]. See https://docs.ray.io/en/latest/rllib-algorithms.html

Algorithm 1:

CIC Implementation (based on [31]) input: Agent policy 𝜋 (·) 𝑝 ( 𝑔 𝑗 ) = 𝐺 for all possible signals Discretize the action space [ , E 𝑚𝑎𝑥 ] into 𝑁 𝑏𝑖𝑛𝑠 intervals. for i=1 to 𝑁 𝑠𝑡𝑎𝑡𝑒𝑠 do Generate a state without a signal 𝜎 − 𝑔 randomly for all possible signals 𝑔 𝑗 do Generate agent observation 𝜎 = [ 𝑔 𝑗 , 𝜎 − 𝑔 ] Estimate 𝑝 ( 𝑎 𝑖 | 𝑔 𝑗 ) by sampling 𝑁 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 actionsfrom 𝜋 ( 𝜎 ) 𝑝 ( 𝑎 𝑖 , 𝑔 𝑗 ) = 𝑝 ( 𝑎 𝑖 | 𝑔 𝑗 ) 𝑝 ( 𝑔 𝑗 ) end 𝑝 ( 𝑎 𝑖 ) = Í 𝑗 𝑝 ( 𝑎,𝑔 𝑗 ) 𝐶𝐼𝐶 + = 𝑁 𝑠𝑡𝑎𝑡𝑒𝑠 Í 𝑎 𝑖 | 𝑝 ( 𝑎 𝑖 ) ≠ , 𝑔 𝑗 𝑝 ( 𝑎,𝑔 𝑗 ) 𝑙𝑜𝑔 ( 𝑝 ( 𝑎,𝑔 𝑗 ) 𝑝 ( 𝑎 ) 𝑝 ( 𝑔 𝑗 ) ) end values, and each interval is assumed to correspond to a single dis-crete action. Let 𝑎 𝑖 denote the event of an action 𝜖 𝑛,𝑡 belongingto interval 𝑖 . To calculate the CIC value, we start by generating 𝑁 𝑠𝑡𝑎𝑡𝑒𝑠 random ‘partial’ states (i.e., without signal), 𝜎 − 𝑔 , whichare then concatenated with each possible signal value to obtaina ‘complete’ state, 𝜎 = [ 𝑔 𝑗 , 𝜎 − 𝑔 ] (Lines 5 - 7 of Alg. 1). Then, weestimate the probability of an action given a signal value, 𝑝 ( 𝑎 𝑖 | 𝑔 𝑗 ) ,by generating 𝑁 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 actions from our policy given the ‘com-plete’ state ( 𝜋 ( 𝜎 ) ), and normalizing the number of instances inwhich the action belongs to a particular bin with the total num-ber of samples. The remaining aspects of the calculation are thesame as in the original implementation. In our calculations we used 𝑁 𝑠𝑡𝑎𝑡𝑒𝑠 = 𝑁 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = D.4 Fairness Metrics

We also evaluated the fairness of the ﬁnal allocation, to ensure thatagents are not being exploited by the introduction of the signal. Weused two of most established fairness metrics: the Jain index [22]and the Gini coeﬃcient [16]: (a)

The Jain index [22]: Widely used in network engineeringto determine whether users or applications receive a fair share ofsystem resources. It exhibits a lot of desirable properties such as:population size independence, continuity, scale and metric inde-pendence, and boundedness. For an allocation of 𝑁 agents, suchthat the 𝑛 th agent is alloted 𝑥 𝑛 , the Jain index is given by Eq. 19. J ( x ) ∈ [ , ] . An allocation x = ( 𝑥 , . . . , 𝑥 𝑁 ) ⊤ is considered fair,iﬀ J ( x ) = J ( x ) = (cid:18) 𝑁 Í 𝑛 = 𝑥 𝑛 (cid:19) 𝑁 𝑁 Í 𝑛 = 𝑥 𝑛 (19) (b) The Gini coeﬃcient [16]: One of the most commonly usedmeasures of inequality by economists intended to represent thewealth distribution of a population of a nation. For an allocationgame of 𝑁 agents, such that the 𝑛 th agent is alloted 𝑥 𝑛 , the Ginioeﬃcient is given by Eq. 20. G ( x ) ≥

0. A Gini coeﬃcient of zeroexpresses perfect equality, i.e., an allocation is fair iﬀ G ( x ) = G ( x ) = 𝑁 Í 𝑛 = 𝑁 Í 𝑛 ′ = | 𝑥 𝑛 − 𝑥 𝑛 ′ | 𝑁 𝑁 Í 𝑛 = 𝑥 𝑛 (20)Both metrics showed that learning both with and without thesignal results in fair allocations, with no signiﬁcant change withthe introduction of the signal (see Section E). E SIMULATION RESULTS IN DETAIL

In this section we provide numerical values of the simulation re-sults presented in the main text. Speciﬁcally:Tables (2 and 3), (4 and 5), (6 and 7), (8 and 9), and (10 and 11) in-clude the results (absolute values with and without the introducedsignal, relative diﬀerence, and Student’s T-test p-values) on socialwelfare, episode lengths (in time-steps), training time (in numberof episodes), Jain index, and Gini coeﬃcient, respectively.Table 12 presents the CIC values.Tables 13, 14, 15, 16, and 17 present the results on the aforemen-tioned metrics for varying signal size, 𝐺 = { , 𝑁 , , 𝑁 , , 𝑁 } .We also ran simulations with higher growth rate, speciﬁcally 𝑟 =

2. The results can be found in Tables 18, and 19. Every re-source has a natural upper limit on the size of the population it can sustain. Fig. 2 shows that as the number of agents grow, wereach sustainable strategies at higher equilibrium stock multipli-ers ( 𝑀 𝑠 ). Thus, we expect that as we increase the growth rate, theeﬀect of the signal will be even more pronounced in larger popu-lations ( 𝑁 ) – which is corroborated by the aforementioned Tables.The environment’s ability to sustain a population also aﬀects thecounts of ‘active’ agents of Section 3.7.2. As we can see in Fig. 2, fora growth rate of 𝑟 =

1, the resource is too scarce, thus, even withthe addition of the signal, the ﬁrst sustainable strategy is reachedat 𝑀 𝑠 = .

9. This is a high equilibrium stock multiplier, close to thelimit of sustainable harvesting. As a result, the number of ‘active’agents is naturally really high because they do not need to harvestin turns. By increasing the growth rate to 𝑟 =

2, we have an en-vironment that can sustain larger populations. Therefore, the ﬁrststrategy that does not result to an immediate depletion is reachedmuch earlier, and we can observe the emergence of a temporal con-vention (see Table 20).Finally, Table 20 shows the average number of agents in each bin– 𝜖 ∈ [ − . ) (‘idle’), [ . − . ) (‘moderate’), and [ . − ] (‘active’) – starting from the ﬁrst equilibrium stock multiplier ( 𝑀 𝑠 )where a non-depleting strategy was achieved in each setting.All reported results are the average values over 8 trials. Notealso that, as was speciﬁed in Section 3.1.1, 𝐾 = 𝑆 𝑁,𝑟𝐿𝑆𝐻 𝑁 = 𝑒 𝑟 E 𝑚𝑎𝑥 ( 𝑒 𝑟 − ) .Therefore, in the settings where the growth rate is 𝑟 = 𝐾 ≈ . 𝑟 = 𝐾 ≈ . able 2: Social WelfareResults (averaged over 8 trials) for increasing population size ( 𝑁 ∈ [ , ] ), with ( 𝐺 = 𝑁 ) and without ( 𝐺 = ) the introducedsignal, for environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 3: Social Welfare (Relative diﬀerence & p-values)(i) Relative diﬀerence in Social Welfare when signal of cardinality 𝐺 = 𝑁 is introduced ( ( Result 𝐺 = 𝑁 − Result 𝐺 = )/ Result 𝐺 = ,where Result 𝐺 = 𝑋 denotes the achieved result using a signal of cardinality 𝑋 ), and(ii) Student’s T-test p-values,for varying population size ( 𝑁 ∈ [ , ] ) and environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . able 4: Episode Length ( 𝑁 ∈ [ , ] ), with ( 𝐺 = 𝑁 ) and without ( 𝐺 = ) the introducedsignal, for environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 5: Episode Length (Relative diﬀerence & p-values)(i) Relative diﬀerence in Episode Length when signal of cardinality 𝐺 = 𝑁 is introduced ( ( Result 𝐺 = 𝑁 − Result 𝐺 = )/ Result 𝐺 = ,where Result 𝐺 = 𝑋 denotes the achieved result using a signal of cardinality 𝑋 ), and(ii) Student’s T-test p-values,for varying population size ( 𝑁 ∈ [ , ] ) and environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ).NaN values in the p-values column are due to having only a single data point; both cases (with and without the signal) havethe same episode length in all the trials. 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . able 6: Training Time ( 𝑁 ∈ [ , ] ), with ( 𝐺 = 𝑁 ) and without ( 𝐺 = ) the introducedsignal, for environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 7: Training Time (Relative diﬀerence & p-values)(i) Relative diﬀerence in Training Time when signal of cardinality 𝐺 = 𝑁 is introduced ( ( Result 𝐺 = 𝑁 − Result 𝐺 = )/ Result 𝐺 = ,where Result 𝐺 = 𝑋 denotes the achieved result using a signal of cardinality 𝑋 ), and(ii) Student’s T-test p-values,for varying population size ( 𝑁 ∈ [ , ] ) and environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . able 8: Jain Index (higher is fairer)Results (averaged over 8 trials) for increasing population size ( 𝑁 ∈ [ , ] ), with ( 𝐺 = 𝑁 ) and without ( 𝐺 = ) the introducedsignal, for environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 9: Jain Index (Relative diﬀerence & p-values)(i) Relative diﬀerence in Jain Index when signal of cardinality 𝐺 = 𝑁 is introduced ( ( Result 𝐺 = 𝑁 − Result 𝐺 = )/ Result 𝐺 = , whereResult 𝐺 = 𝑋 denotes the achieved result using a signal of cardinality 𝑋 ), and(ii) Student’s T-test p-values,for varying population size ( 𝑁 ∈ [ , ] ) and environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . able 10: Gini Coeﬃcient (lower is fairer)Results (averaged over 8 trials) for increasing population size ( 𝑁 ∈ [ , ] ), with ( 𝐺 = 𝑁 ) and without ( 𝐺 = ) the introducedsignal, for environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 11: Gini Coeﬃcient (Relative diﬀerence & p-values)(i) Relative diﬀerence in Gini Coeﬃcient when signal of cardinality 𝐺 = 𝑁 is introduced ( ( Result 𝐺 = 𝑁 − Result 𝐺 = )/ Result 𝐺 = ,where Result 𝐺 = 𝑋 denotes the achieved result using a signal of cardinality 𝑋 ), and(ii) Student’s T-test p-values,for varying population size ( 𝑁 ∈ [ , ] ) and environments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . able 12: CIC valuesResults (averaged over the 8 trials and the agents in the population) for increasing population size ( 𝑁 ∈ [ , ] ) and environ-ments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑁 = 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 13: Social WelfareResults (averaged over 8 trials) for varying signal size ( 𝐺 = { , 𝑁 , , 𝑁 , , 𝑁 } , where 𝑁 = ) and equilibrium stock multiplier( 𝑀 𝑠 values of . , . and . ). The following results include:(i) Absolute values,(ii) Relative diﬀerence (%), i.e., ( Result 𝐺 = 𝑋 − Result 𝐺 = )/ Result 𝐺 = , where Result 𝐺 = 𝑋 denotes the achieved result using a signalof cardinality 𝑋 ∈ { 𝑁 , , 𝑁 , , 𝑁 } , and(iii) Student’s T-test p-values with respect to 𝐺 = Absolute Values Relative Diﬀerence (%) p-values 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 14: Episode Length ( 𝐺 = { , 𝑁 , , 𝑁 , , 𝑁 } , where 𝑁 = ) and equilibrium stock multiplier( 𝑀 𝑠 values of . , . and . ). The following results include:(i) Absolute values,(ii) Relative diﬀerence (%), i.e., ( Result 𝐺 = 𝑋 − Result 𝐺 = )/ Result 𝐺 = , where Result 𝐺 = 𝑋 denotes the achieved result using a signalof cardinality 𝑋 ∈ { 𝑁 , , 𝑁 , , 𝑁 } , and(iii) Student’s T-test p-values with respect to 𝐺 = Absolute Values Relative Diﬀerence (%) p-values 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . able 15: Training Time ( 𝐺 = { , 𝑁 , , 𝑁 , , 𝑁 } , where 𝑁 = ) and equilibrium stock multiplier( 𝑀 𝑠 values of . , . and . ). The following results include:(i) Absolute values,(ii) Relative diﬀerence (%), i.e., ( Result 𝐺 = 𝑋 − Result 𝐺 = )/ Result 𝐺 = , where Result 𝐺 = 𝑋 denotes the achieved result using a signalof cardinality 𝑋 ∈ { 𝑁 , , 𝑁 , , 𝑁 } , and(iii) Student’s T-test p-values with respect to 𝐺 = Absolute Values Relative Diﬀerence (%) p-values 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 16: Jain Index (higher is better)Results (averaged over 8 trials) for varying signal size ( 𝐺 = { , 𝑁 , , 𝑁 , , 𝑁 } , where 𝑁 = ) and equilibrium stock multiplier( 𝑀 𝑠 values of . , . and . ). The following results include:(i) Absolute values,(ii) Relative diﬀerence (%), i.e., ( Result 𝐺 = 𝑋 − Result 𝐺 = )/ Result 𝐺 = , where Result 𝐺 = 𝑋 denotes the achieved result using a signalof cardinality 𝑋 ∈ { 𝑁 , , 𝑁 , , 𝑁 } , and(iii) Student’s T-test p-values with respect to 𝐺 = Absolute Values Relative Diﬀerence (%) p-values 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 17: Gini CoeﬃcientResults (averaged over 8 trials) for varying signal size ( 𝐺 = { , 𝑁 , , 𝑁 , , 𝑁 } , where 𝑁 = ) and equilibrium stock multiplier( 𝑀 𝑠 values of . , . and . ). The following results include:(i) Absolute values,(ii) Relative diﬀerence (%), i.e., ( Result 𝐺 = 𝑋 − Result 𝐺 = )/ Result 𝐺 = , where Result 𝐺 = 𝑋 denotes the achieved result using a signalof cardinality 𝑋 ∈ { 𝑁 , , 𝑁 , , 𝑁 } , and(iii) Student’s T-test p-values with respect to 𝐺 = Absolute Values Relative Diﬀerence (%) p-values 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . able 18: Social Welfare, Episode Length, Training Time, Jain Index, Gini CoeﬃcientResults (averaged over 8 trials) for higher growth rate ( 𝑟 = ), with ( 𝐺 = 𝑁 ) and without ( 𝐺 = ) the introduced signal, forenvironments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ), and population size 𝑁 = . Social Welfare Episode Length Training Time Jain Index Gini Coeﬃcient 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Table 19: Social Welfare, Episode Length, Training Time, Jain Index, Gini Coeﬃcient(i) Relative diﬀerence in the achieved result when signal of cardinality 𝐺 = 𝑁 is introduced ( ( Result 𝐺 = 𝑁 − Result 𝐺 = )/ Result 𝐺 = ,where Result 𝐺 = 𝑋 denotes the achieved result using a signal of cardinality 𝑋 ), and(ii) Student’s T-test p-values,Results (averaged over 8 trials) for higher growth rate ( 𝑟 = ), with ( 𝐺 = 𝑁 ) and without ( 𝐺 = ) the introduced signal, forenvironments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ), and population size 𝑁 = . Social Welfare Episode Length Training Time Jain Index Gini Coeﬃcient(%) p-value (%) p-value (%) p-value (%) p-value (%) p-value 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . able 20: Average number of agents in each bin (i.e., harvesting with eﬀort 𝜖 ∈ [ − . ) (‘idle’), [ . − . ) (‘moderate’), and [ . − ] (‘active’)). The presented values start from the ﬁrst equilibrium stock multiplier ( 𝑀 𝑠 ) where a non-depleting strategywas achieved in each setting.Results (averaged over 8 trials) for increasing population size, with ( 𝐺 = 𝑁 ) and without ( 𝐺 = ) the introduced signal, forenvironments of decreasing diﬃculty (increasing 𝑆 𝑒𝑞 ∝ 𝑀 𝑠 ). Number of ‘idle’ agents: 𝜖 ∈ [ − . ) 𝑁 = 𝑁 = 𝑁 = 𝑁 = , 𝑟 = 𝑁 = , 𝑟 = 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Number of ‘moderate’ agents: 𝜖 ∈ [ . − . ) 𝑁 = 𝑁 = 𝑁 = 𝑁 = , 𝑟 = 𝑁 = , 𝑟 = 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . Number of ‘active’ agents: 𝜖 ∈ [ . − ] 𝑁 = 𝑁 = 𝑁 = 𝑁 = , 𝑟 = 𝑁 = , 𝑟 = 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁 𝐺 = 𝐺 = 𝑁𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = . 𝑀 𝑠 = ..