[PDF] A Q-values Sharing Framework for Multiagent Reinforcement Learning under Budget Constraint

Abstract

In teacher-student framework, a more experienced agent (teacher) helps accelerate the learning of another agent (student) by suggesting actions to take in certain states. In cooperative multiagent reinforcement learning (MARL), where agents need to cooperate with one another, a student may fail to cooperate well with others even by following the teachers' suggested actions, as the polices of all agents are ever changing before convergence. When the number of times that agents communicate with one another is limited (i.e., there is budget constraint), the advising strategy that uses actions as advices may not be good enough. We propose a partaker-sharer advising framework (PSAF) for cooperative MARL agents learning with budget constraint. In PSAF, each Q-learner can decide when to ask for Q-values and share its Q-values. We perform experiments in three typical multiagent learning problems. Evaluation results show that our approach PSAF outperforms existing advising methods under both unlimited and limited budget, and we give an analysis of the impact of advising actions and sharing Q-values on agents' learning.

Full PDF

AA Q-

VALUES S HARING F RAMEWORK FOR M ULTIAGENT R EINFORCEMENT L EARNING UNDER B UDGET C ONSTRAINT

A P

REPRINT

Changxi Zhu

South China University of Technology [email protected]

Ho-fung Leung

The Chinese University of Hong Kong [email protected]

Shuyue Hu

National University of Singapore [email protected]

Yi Cai

South China University of Technology [email protected]

December 1, 2020 A BSTRACT

In teacher-student framework, a more experienced agent (teacher) helps acceleratethe learning of another agent (student) by suggesting actions to take in certain states.In cooperative multiagent reinforcement learning (MARL), where agents need tocooperate with one another, a student may fail to cooperate well with others evenby following the teachers’ suggested actions, as the polices of all agents are everchanging before convergence. When the number of times that agents communicatewith one another is limited (i.e., there is budget constraint), the advising strategythat uses actions as advices may not be good enough. We propose a partaker-shareradvising framework (PSAF) for cooperative MARL agents learning with budgetconstraint. In PSAF, each Q-learner can decide when to ask for Q-values and shareits Q-values. We perform experiments in three typical multiagent learning problems.Evaluation results show that our approach PSAF outperforms existing advisingmethods under both unlimited and limited budget, and we give an analysis of theimpact of advising actions and sharing Q-values on agents’ learning.

Keywords multiagent reinforcement learning · cooperative learning · Q-learner · knowledge sharing Many real-world tasks, e.g., robotic games [1] and distributed power allocation [2], involve a set ofagents in a common environment. By learning from trial-and-error, an agent adapts its behaviorin an unfamiliar environment, which forms a popular mechanism Reinforcement Learning (RL).However, RL methods require a long period of interactions with the environment, resulting inlimited scalability and applicability in complex situations. When RL is used in a Multiagent System a r X i v : . [ c s . M A ] N ov artaker-Sharer Advising Framework A P

REPRINT (MAS), which is also referred as Mutiagent Reinforcement Learning (MARL), these problems arefurther intensiﬁed since all agents are changing and adapting their behaviors [3], thereby increasingthe amount of interactions needed. To know whether an action is optimal or not in the MARL, anagent must execute the action, often many times. When the computing resource of agents is limited,the time they spend should be used more efﬁciently over entire lifetime.One natural solution to speed up the learning process of RL agents is via knowledge sharing , relyingon communicating various pieces of external knowledge, such as policies, value functions andepisodes (state, action, quality triplets) among them. When an agent knows rarely about currentenvironment, there is no doubt that it can reuse knowledge from other agents who have acquiredsome skills in advance to reduce the time to explore. There are three problems when agentscommunicate with one another: (1) deciding to share what kind of knowledge; (2) deciding whento share knowledge, especially when communication is limited due to cost; (3) deciding how tointegrate received knowledge. Recently, one notable approach is the teacher-student framework,which focuses on

Action Advising [4, 5]. In this framework, a more experienced agent or humanexpert (teacher) aids the learning of another agent (student) by suggesting actions to take in certainstates. The advising opportunities are established jointly by teacher and student, which is suitablefor the case of limited communication. In order to model communication cost, the number oftimes that a student asks for action advice and a teacher provides advice are constrained by twobudgets respectively, i.e., budget constraint. The framework has achieved promising results onboth single-agent and multi-agent problems. In this paper, we consider that multiple RL agentscooperatively try to solve a task, where they learn to cooperate with each other by coherentlychoosing their individual actions based on their own observations, such that the resulting joint actionis optimal. Each agent learns to act optimally by maximising its Q-function, which implies thecumulative discounted rewards for every state. Under the circumstance, however, a student may failto cooperate well with others even by following a teacher’s suggested actions, since all other agentsare learning and adjusting their own polices as training progresses. More importantly, it is stillhard for the student to learn, speciﬁcally, the Q-values corresponding to the advised actions. Theproblem of the teacher-student framework applied in cooperative MARL is that advising actions toa student do not inherently change the student’s policy. The student needs to take a long period toadapt its behaviours for other agents in the changing environment.Sharing knowledge is of utmost importance for cooperative agents learning in a shared environment.In such case, if they are equipped with the same learning structure, they most likely becomeinterchangeable, i.e. they have identical optimal policies and value functions. Without knowledgemapping, an agent can directly utilize the knowledge, e.g., Q-values, from more experienced agentsas its own. If a student is able to choose next action upon its teachers’ Q-values in current state,then it can act without the necessary of taking more time to learn for that state. Using the Q-valuesfrom a teacher requires that the agents involved in advising relations are reinforcement learners andhave similar (even the same) reward functions. Although it may sacriﬁce ﬂexibility compared withaction advising, nevertheless, we here argue that sharing Q-values is more effective than advisingactions for a cooperative team of RL agents. The Q-values from other agents guide an agent’sexploitation of the team’s currently learned Q-values while not only learning the optimal Q-functionby itself. Another critical consideration in our work is that sharing Q-values for all agents at everytime step need time and can be costly, especially for some multiagent systems composed of a largepopulation of agents, such as in the ﬁeld of social learning [6]. Similar to the advising budget in

A P

REPRINT teacher-student framework, we assume that each agent are constrained by two budgets for askingfor and giving Q-values. It is vital to decide proper time to share Q-values.In this paper, we present a partaker-sharer advising framework (PSAF) for cooperative MARLagents to share Q-values under budget constraint, where only a limited number of Q-values can beshared in the whole learning process. In PSAF, a more experienced agent (sharer) can provide theother agent (partaker) with the maximum Q-value for the state of the partaker, which elucidates thesharer’s best action. Each agent can play the role of partaker or sharer in different sharing processes.An agent can decide when to start asking for Q-values according to how many time it has exploredfor current state. That is, if it visits current state very few times, it can initiate a sharing process,take the role of partaker, and send a request to all other agents. In order to prevent from sharingeven worse Q-values to the partaker, another agent will evaluate how conﬁdent it is in the partaker’sstate. If the agent has updated its maximum Q-value many more times than the partaker, then it canjoin in the partaker’s sharing process, take the role of sharer, and share its maximum Q-value. Thereare two numeric budgets of each agent, for requesting and providing Q-values respectively.Our contributions in this paper are threefold. First, we propose a novel Q-values sharing frameworkPSAF for cooperative MARL agents learning with budget constraint. Since agents only can share alimited number of Q-values, each agent in PSAF should decide when to ask for Q-values and shareits Q-values, and how to integrate the shared Q-values into its own Q-function. Second, we highlightthat the amount of budget is essential for agents to advise actions as well as sharing Q-values, andintroduce new metric for the sharer to decide sharing opportunities, in order to use the budget moreefﬁciently. We also explore the effect of different amount of budgets on the performance of allcomparing methods. Third, we show that our proposed framework PSAF signiﬁcantly acceleratesthe learning process of agents in three typical multiagent learning problems: Predator-Prey domain,Half Field Offense and Spread game. We try to ﬁgure out when agents share Q-values and actions.The distribution of sharing opportunities shows that in PSAF, most Q-values are shared when apartaker has updated the Q-values for corresponding state-action pairs very few times while a sharerhas updated many times. Besides, we compare and contrast the effects of sharing Q-values andadvising actions on agents’ learning, which has not been investigated in previous works.The remainder of this article is organized as follows. Section 2 describes related work for sharingQ-values and the teacher-student framework. In Section 3, we present the background informationon multiagent reinforcement learning, temporal difference learning algorithms with eligibility traces,and a jointly-initiated framework. Section 4 formally presents our method PSAF, including when toask for and give Q-values, and how to use the received Q-values. In section 5, we conduct empiricaldemonstrations in Predator-Prey domain, Half Field Offense and Spread game, showing that ourapproach outperforms existing advising methods under unlimited and limited budgets. In section6, we give an analysis of advising actions and sharing Q-values, and discuss which situations aresuitable for PSAF. Finally, in section 7, we draw the conclusions of this article and propose futuredirections.

The idea of sharing Q-values has its roots in multiagent reinforcement learning. The Q-values ofreinforcement learners are the expected cumulative discounted reward gained by performing actionsin states during learning. Sharing Q-values reduce the time to learn actions to be rewarded. The

A P

REPRINT centralized learning method [7] learns a shared Q-function for the actions and observations of allagents in an environment. Both the state and action space scale exponentially with the number ofagents, rendering this approach infeasible for thousands of agents learning together. By contrast, weconsider that agents are decentralized in both training and execution, which is convenient for agentsjoining and leaving [8, 9, 10]. Concretely, each agent is equipped with separate processor andmemory to update and store its own policy. One simple method without any communication is touse independent learners [11, 8], where each agent individually learns its policy or Q-function basedon its own observations. However, it is nontrivial how decentralized agents beneﬁt from sharingQ-values with one another, in order to accelerate learning process. The ﬁrst line of sharing Q-valuesassumes that the Q-values being shared has been prepared before current learning. One typicalapproach is transfer learning, especially for transferring Q-functions among different tasks [12, 13],whereby an agent can learn faster in a target task after training on a less complex task. Even thoughthe method leads to a dramatic speedup in learning, it requires to build hand-coded action-valuefunction relationship, for example, how to map different state representations and reward functionsfor different tasks. When several agents learn to cooperate in a multiagent system, they may needto coordinate in a few states while act independently in the other states. By using this idea, Kokand Vlassis [14] propose Sparse Tabular Multiagent Q-learning that maintains a predeﬁned list ofstates in which coordination is necessary. Only in these states, agents perform joint actions anda shared Q-table is updated. Then they learn their own Q-values in all remaining states. In orderto automatically identify coordinating states, De Hauwere et. al propose CQ-learning [15] andFCQ-learning [16]. Both methods assume that an agent has already learned its optimal policy alonebefore learning together with other agents. When an agent executes in the multiagent environment,a state is marked as conﬂict if it detects a change in the received immediate reward compared withthe expected reward derived from its optimal single-agent policy. Then it incorporates its previouslyacquired Q-values from single-agent environment into currently joint Q-function, otherwise previousQ-values are used to select next actions. However, these works assume that agents have alreadylearnt (or are able to learn) some kind of single-agent knowledge (e.g., local value function) beforethe multi-agent learning process. By comparison, we consider that multiple agents start learningwith no previous knowledge in a shared environment, then they must rely on their partners who mayhave explored different regions of the state space. Tan [17] points out that cooperative Q-learningagents can communicate several aspects of their learning process for current task, such as policies(or complete Q-tables), episodes (state, action, quality triplets), and sensation. The Q-values forevery state-action pair of all agents are averaged in a predeﬁned time interval, and then the averagedQ-values are assigned for each agent. Nevertheless, imagine that agents communicate over wirelesschannels such as Wi-Fi and LTE, sharing the whole Q-values may incur high-communication cost,especially when agents learn in a large-scaled multiagent system. One of the relevant works mostsimilar to ours is Parallel Transfer Learning (PTL) [18, 19], which enables multiple agents totransfer their Q-values when they are simultaneously learning in different tasks or the same task. Inorder for knowledge sharing to be beneﬁcial, the Q-values to be shared are carefully selected. Whenall agents learn together in a common environment, a ﬁxed number of Q-values is transferred atevery time step according to the importance of associated states. In PTL, it is necessary for agentsto connect with each other constantly in order to share their Q-values. In comparison, we seek toaddress how agents communicate with each other only if necessary, and an agent is able to decidefor what situation it can share its Q-value when requested. The major merit of our framework is its

A P

REPRINT ﬂexibility and applicability for an enormous quantity of agents learning cooperatively under budgetconstraint.There are many ways that address the issue of how to use received Q-values from other agents, eventhough they generally ignore the problems of when to shared Q-values, as emphasized in our work.These methods can be classiﬁed into: a) Average-Q; b) Weighted-Q; c) Replace-Q. The Average-Qmethod, also known as Policy Average [17], is the simplest way to utilize the shared Q-values. Thismethod assumes that all agents have the same contribution to the corresponding state-action pairs.When an agent communicates with other agents, its own Q-values as well as the shared Q-valuesare averaged, so as to generate the new Q-values for corresponding state-action pairs. Since agentsmay have experienced different areas of the state-action space at a given time step, some outdatedQ-values are taken into account in the Average-Q method. An agent needs to combine the sharedQ-values more effectively. The Weighted-Q method allows a weighted combination of received andlocal Q-values. Ahmadabadi and Asadpour [20] propose several weighted strategies to generate newQ-values from an agent’s current Q-values and the received Q-values. In this method, each agentmeasures the expertness of its teammates and assigns a weight to their knowledge and learns fromthem accordingly. In PTL [19], as the number of times that an agent visits a state increases, locallylearned Q-values are taken more into account than the received Q-values. However, when an agentknows nothing or very few about a particular state, incorporating its Q-values may sharply decreasethe new Q-values, which makes the learning process become unstable. In the Replace-Q method, anagent accepts the shared knowledge fully as it has no better information. The more complex case isthat the received Q-values are potentially from multiple sources. The agent should decide to selectone of them. The experience counting, inspired by the non-trivial update counting method in [21],is another popular sharing strategy [22]. The experience counting makes use of an experience table,in addition to a Q-function to keep track of which states and what actions an agent has actuallyvisited during learning. The central idea behind this method is that the most experienced agentshould have the most contribution to the Q-value with regard to corresponding state-action pair.At every time step, all agents use the same Q-value which is from the the agent with the mostexperience for every state-action pair. Considering a tabular representation of a Q-function, whichis, perhaps, the most commonly applied in RL tasks, the new Q-values generated from Average-Q,Weighted-Q and Replace-Q can be directly assigned to corresponding state-action pairs. Anotherpossible way for absorbing the shared Q-values is to update local Q-values in a Q-learning-like rule,e.g., CQ-learning and FCQ-learning. In the Q-learning-like rule, after performing actions in currentstates, the Q-values from other knowledge resources for the next states are used to bootstrap theQ-values for the current states. This method does not need to directly access an agent’s Q-function(e.g., Q-table), which can be extended to the problem with continuous state or action space.Recently, the teacher-student framework [4] has received a lot of attention [5, 23, 24, 13]. In thisframework, a more experienced agent (teacher) accelerate the learning process of another agent(student) by providing advice on which action to take. The student updates its policy based onrewards received from the environment, while its exploration is guided by the teacher’s advice. Someworks, like the supervision framework designed for organizational structure [25] or supervisingreinforcement learning [26], propose to supervise a network of agents by providing rules (forbiddenactions of satisﬁed states), suggestions (preferred actions of satisﬁed states) et al. However, allsupervisors are ﬁxed, and the agents who are supervised by a supervisor need to communicate withtheir own supervisor at every time step. By contrast, the most signiﬁcant aspect of the teacher-

A P

REPRINT student framework is that advising opportunities are determined only when required, consideringthe practical concerns regarding attention and communication. This setting is also widely acceptedby many extensions of the teacher-student framework, as depicted in [27, 28, 5]. Moreover, both themultiagent advising framework [5] and our work assume no ﬁxed roles of each agent, which meanseach agent has the opportunity to be a teacher (or sharer). There have been three modes of decidingthe advising opportunities: student-initiated [29], teacher-initiated [4] and jointly-initiated [30]. Inthe student-initiated method, a student who learns to act optimally in a task is assisted by an expertteacher whenever the student’s conﬁdence in a state is low. Sharing decisions made by the studentare likely to be weak since itself is still learning. Conversely, in the teacher-initiated method, atrained teacher decides when to give advice to a student. This method requires the student’s currentstate is always communicated to the teacher, and the teacher should constantly pay attention to thelearning of the student. Transmitting every state that the student has experienced to the teachercan cause a prohibitive cost of communication. The jointly-initiated approach determines advisingopportunities under the agreement of both teacher and student. In this method, the teacher is notrequired to constantly monitor the student. The relation of a student and a teacher is established ondemand, which is particularly suitable for the case of limited communication. The work of Silvaet al. [5] is the ﬁrst to apply the teacher-student framework to a multiagent system composed ofmultiple simultaneously learning agents. In order to overcome the challenge of having no ﬁxedroles of teacher and student, they extend the heuristics from [4] to measure an agent’s conﬁdence ina given state based on the number of times that it visits the state. The number of times that a studentasks for advice and a teacher gives action advice are limited by two budgets respectively. Withthe constraint of budget, this approach achieves state-of-the-art results even when comparing withsharing the whole episode during learning. Despite having promising results, the advising strategythat uses actions as advices may not be good enough to accelerate the overall learning process forQ-learners.

We are interested in a cooperative multi-agent setting where agents get local observations andlearn in a decentralised fashion. All communications among agents must be clearly speciﬁed. Thelearning problem of multiple decentralised agents with local observations is generally modelledas a Decentralised partially observable Markov decision process (Dec-POMDP) [31], which isan extension of Markov Decision Process (MDP) [32]. A Dec-POMDP is deﬁned by a tuple (cid:104)I , S , A , T , R , Ω , O , γ (cid:105) , where I is the set of n agents, S is the set of environment states, A = × i ∈I A i is the set of joint actions, T is the state transition probabilities, R is the reward function, Ω = × i ∈I Ω i is the set of joint observations, O is the set of conditional observation probabilities, and γ ∈ [0 , is the discount factor. At every time step t in an environment state s (cid:48) , each agent i perceivesits own observation o it from one joint observation o = (cid:104) o , ..., o n (cid:105) determined by O ( o | s (cid:48) , a ) , where a = (cid:104) a , ..., a n (cid:105) is the joint action that causes the state transition from s to s (cid:48) according to T ( s (cid:48) | a , s ) ,and receives reward r i determined by R ( s, a ) . We focus on cooperative Multiagent ReinforcementLearning (MARL), where several reinforcement learning agents jointly affect the environment andreceive the same reward ( r = r = ...r n ). The observability of agents in Dec-POMDPs has beenelaborately discussed in Oliehoek and Amato’s work [31]. In our framework, we assume that agents A P

REPRINT are able to observe each other at any time and infer their local observations. That is, individualobservation (partial view) for each of the agents always uniquely identiﬁes the environment state.Since agents are distributed, it is still difﬁcult to learn the optimal policy for each of them [31].

Temporal difference (TD) learning algorithms such as Q-learning [33] and SARSA [32] learn anaction-value function, Q ( s, a ) , which is an estimate of the expected cumulative discounted rewardthat an agent takes action a in state s . In our work, each agent individually receives its own statefrom a shared environment, and learn a Q-function. Learning a policy for an agent i means to betterestimate its Q-function Q i ( s i , a i ) for corresponding action a i in its own state s i . Since all agentshave the same reward function (equivalent to a joint reward function), their Q-functions are alsothe same when no explicit specialization is assigned, e.g., different learning structures. We deﬁnethe experience of agent i at time step t as a tuple (cid:104) s it , a it , s it +1 , r it +1 (cid:105) , where state s it +1 is reached bythe agent after it executing action a it in its state s it , and reward r it +1 is received. The Q-functionof agent i is incrementally updated based on the agent’s experience gained in learning, using theweighted average of the old value and the new information. Therefore, the update of Q-values foreach state-action pair are deﬁned as follows: Q it ( s it , a it ) ← Q it ( s it , a it ) + αδ (1)where α ∈ [0 , is learning rate, and δ is TD error. In Q-learning, we have: δ = r it +1 + γ max a Q it ( s it +1 , a ) − Q it ( s it , a it ) (2)where γ ∈ [0 , is discount factor. RL agents face to choose whether to focus on high rewardactions or taking actions with the intent of exploring the environment. We consider agents adopt (cid:15) -greedy for action selection. As such, at every time step, with a large probability − (cid:15) , an agenttakes the action with the highest Q -value in the current state, and with a small probability (cid:15) , theagent takes a random action. Then in normal learning (without asking for advice), a Q-learningagent can choose the action to be executed at every learning step according to (cid:15) -greedy. SARSA,which is another popular RL algorithm, deﬁnes the TD error as follows: δ = r it +1 + γQ it ( s it +1 , a it +1 ) − Q it ( s it , a it ) (3)where a it +1 is the next action that agent i will execute in s it +1 according to a deﬁned explorationstrategy like (cid:15) -greedy. These algorithms are guaranteed to converge to the optimal Q-function Q ∗ i ,from which the optimal policy π ∗ i of agent i can be derived: π ∗ i ( s i ) = arg max a Q ∗ i ( s i , a ) (4)In order to speed up reinforcement learning, instead of updating one Q-value at every time step,n-steps can be used for making a backup. We adopt Eligibility traces [32] to further enhance theperformance of TD algorithms. In this strategy, each agent records the state-action pairs whichhave been recently visited. The TD error at current time step is used to update the Q-values forthese state-action pairs. Eligibility traces aims to assign the credit or blame to the eligible states or In this paper, the local observation of an agent is also referred as the agent’s own state. The former one is from a multi-agentview while the latter one is from a single-agent view

A P

REPRINT actions. At time step t in an episode, agent i updates its accumulating traces for the states that havebeen visited and the actions that have been taken so far by following rules: e it ( s i , a i ) = (cid:40) γλe it − ( s i , a i ) , if s i (cid:54) = s it γλe it − ( s i , a i ) + 1 , if s i = s it (5)where λ ∈ [0,1] refers to traces decay rate. Q( λ ) and SARSA( λ ) are the extension of Q-learning andSARSA, respectively. In Q( λ ) or SARSA( λ ), an agent uses weighted TD errors by using eligibilitytraces as weights to update Q-values for every experienced state-action pair in current episode. ThenEq. 1 is modiﬁed as follows: Q i ( s i , a i ) ← Q i ( s i , a i ) + αδe it ( s i , a i ) (6) Recent works on teacher-student framework mainly adopt jointly-initiated method to build advisingrelations, which is more appropriate for learning with budget constraint. Here we introduce a multi-agent advising framework AdhocTD[5], in order to illustrate how to construct advising relations(for action advice) only when necessary. This framework has shown promising results in complexstochastic environment Half Field Offense[34], and is considered as the state-of-the-art approach.In AdhocTD, at each time step, agent a i asks for advice with an asking probability P ask in currentstate s. The asking probability is calculated as follows: P ask ( s ) = (1 + v a ) − √ n visit ( s ) (7)where v a is a predetermined parameter, n visit ( s ) is the number of times that the agent visits state s .Agent a j gives its best action with a giving probability P give for the state of agent a i . The givingprobability is calculated as follows: P give ( s ) = 1 − (1 + v b ) − √ n visit ( s ) × I ( s ) (8)where v b is a predetermined parameter, and I ( s ) encodes the difference between agent a j ’s maximumQ-value and its minimum Q-value in the requested state s : I ( s ) = max a Q ( s, a ) − min a Q ( s, a ) (9)All agents are limited by a budget b ask to ask for advice and a budget b give to give advice. Our approach aims to accelerate learning for a cooperative team of reinforcement learning agentsin a multiagent environment. These agents are assumed to be homogeneous, meaning that theyare interchangeable, i.e. they have identical optimal policies and value functions. During learningprocess, each agent may have unique experience or local knowledge (i.e., Q-functions) of how toperform effectively in current task. They may not be willing to share the whole Q-functions witheach other due to budget constraint. However, we assume that the team of agents would like toshare a limited number of Q-values during entire lifetime, which is modelled as the limited numberof times that an agent asks for and shares Q-values. Speciﬁcally, each agent can decide to share itsmaximum Q-value for the current state of the agent who asks for Q-values. The maximum Q-values

A P

REPRINT from other agents guide an agent’s exploitation of the team’s currently learned Q-values. The goalof our method is to explore when to ask for Q-values and share the maximum Q-values, as well ashow to use the received maximum Q-values.We here propose a partaker-sharer advising framework (PSAF) for multiple decentralized Q-learners to share Q-values under budget constraint. A sharing process is initiated by a partaker ,which is a role of an agent who asks for Q-value in its own state. The other agents can take the roleof sharer , and share their maximum Q-values for the state of the partaker. Then the partaker willchoose its current action based on its own Q-values as well as the sharers’ Q-values. Each agentcan take different roles in different sharing processes, which depends on whether they decide toask for and share Q-values for, to be speciﬁc, particular states. In this way, an agent may ask forQ-values in current situation while share its maximum Q-value for the requested state of anotheragent. The maximum number of times that an agent takes the role of partaker and sharer are twonumeric budgets b ask and b give respectively.In PSAF, we consider that an agent is more likely to ask for Q-values if it knows very few aboutsome states, and it is more beneﬁcial for the agent to be guided by other more experienced agentsfor those states. Several heuristics from [30, 5] can be used to decide when an agent takes the roleof partaker. They rely on the range of Q-values for the requested state or how many times an agentvisit the state. The range of Q-values can vary very much at the beginning of training so that it maymislead an agent who tries to ask for Q-values, where we expect the agent would like to get morehelp. AdhocTD [5] has deﬁned an asking function P ask ( s ) = (1 + v a ) − √ n visit ( s ) by considering thenumber of times an agent visits a state s . The function outputs a higher probability for requestingadvice when the agent visits state s very few times, while the value of the function declines rapidlywhen the agent gains more experience in that state. In this paper, we adopt the same function P iask ( s i ) to allow an agent i to determine when to initiate a sharing process and take the role ofpartaker in current state s i . After successfully initiating a sharing process, another agent j (fromall the agents except partaker i ) needs to decide whether it would share its maximum Q-values tothe partaker. Previous works on teacher-student framework have proposed several methods (e.g.,function P give deﬁned by AdhocTD) to help a teacher to decide when to advise a student withoutaccessing the student’s current learning. However, if both student and teacher learn from scratch,i.e., their policies are generally non-optimal during learning, the teacher may be less helpful or evensuggest worse advice if it is not able to evaluate how well the student have learned. In PSAF, duringa sharing process, partaker i will tell other agents how conﬁdent it is in its Q-values for currentstate s i . Sharing Q-values with low conﬁdence may interfere the learning of agents who take therole of partaker, and waste communications. Therefore, the other agents can joint in the sharingprocess (of partaker i ) and share their Q-values if they have higher conﬁdence in the Q-values thanthe partaker. Intuitively, as training progresses, if an agent updates its Q-value for a state-action pairmore times, the Q-value generally becomes more reliable. We ﬁrstly propose a conﬁdence function Φ i : S × A → R , which encodes how conﬁdent partaker i is in currently learned Q-values for state s i . Formally we have: Φ i ( s i , a ) = m ivisit ( s i , a ) (10)where m ivisit ( s i , a ) is the number of times that partaker i updates its Q-value for a state-actionpair ( s i , a ) . In order to share much better Q-values than the partaker and then use the budgetmore efﬁciently, another agent j who intends to share its own Q-values only if it has updatedthe Q-values many more times compared with the partaker. Here we propose another conﬁdence A P

REPRINT function Ψ j : S × A → R , which represents the conﬁdence of agent j for state-action pair ( s i , a ) . Ψ j is expected to scale down the updated times of the Q-value for ( s i , a ) to a proper value. Then wedeﬁne Ψ j as follows: Ψ j ( s i , a ) = m jvisit ( s i , a ) × ξ j (11)where m jvisit ( s i , a ) is the number of times that agent j updates its Q-value for state-action pair ( s i , a ) , and ξ j ∈ [0 , . ξ j is used to determine how many times agent j should have updated theQ-value for ( s i , a ) when comparing with partaker i , so that agent j can take the role of sharer. When ξ j is 0, agent j can not share its Q-values. When ξ j is 1, the agent share its Q-values as long as it hasupdated the Q-values more times than partaker i . Now we consider how to construct ξ j . In a givenstate, if all Q-values are nearly the same, it does no matter which one is shared. If the maximumQ-value of agent j in a state is much higher than other Q-values, sharing the maximum Q-value ismore meaningful for partaker i since the Q-value may lead to a ﬁne-gained action. Then agent j ismore likely to be as a sharer even it does not update the maximum Q-value many times. For thesake of convenience, we deﬁne ξ j of agent j to be the difference between the maximum Q-valueand minimum Q-value normalized to [0 , in state s i as follows: ξ j ( s i ) = max a Q j ( s i , a ) − min a Q j ( s i , a )max a Q j ( s i , a ) − min a Q j ( s i , a ) + 1 (12)where max a Q j ( s i , a ) and min a Q j ( s i , a ) are the maximum and the minimum Q-value of agent j in state s i respectively. If the difference between max a Q j ( s i , a ) and min a Q j ( s i , a ) is close to 0,then ξ j is low and agent j should update its maximum Q-value in state s i quite a number of times tomake sure the Q-value is reliable. If the difference between max a Q j ( s i , a ) and min a Q j ( s i , a ) islarge, ξ j is high, so that agent j will have more opportunities to share its maximum Q-value.Algorithm 1 describes when to ask for Q-values and how to use the shared maximum Q-values foragent i . At every time step, the agent in current state s i can take the role of partaker as long asbudget b iask has not been used up (lines 1-3). With the probability calculated by asking function P iask ( s i ) , partaker i initiates a sharing process and broadcasts to all other agents for requestingQ-values (lines 4-6). The broadcast message contains the partaker’s state s i , and its conﬁdence ofeach Q-value under current situation. Then partaker i waits until a predeﬁned timeout to collectanswers. We denote Π as the collection of Q-values from all sharers for the state of the partaker(lines 7-8). If Π is not empty, b iask is decremented by 1 (lines 9-10). Partaker i may receive severalQ-values for the same state-action pair in the sharing process, since some sharers may have thesame best action according to their own Q-values. Besides, each sharer can not access other sharers’Q-functions to avoid additional communication before joining in the partaker’s sharing process. InPSAF, the partaker can either randomly select one of the shared Q-values or choose the Q-valuewith maximum conﬁdence among them, which is deﬁned by function Γ (line 11). Now, partaker i needs to integrate the Q-value selected from function Γ to its local Q-function in current state s i . In this paper, we assume a tabular representation of the Q function. For each correspondingstate-action pair, partaker i replaces the original Q-value with the shared Q-value from Γ (lines12-14). Then, in order to exploit the shared Q-values from the whole team, the partaker executes itscurrently best action corresponding to the currently maximum Q-value in state s i (line 15). If agent i does not ask for Q-values or no Q-value is received, the agent performs usual exploration strategy,i.e., (cid:15) -greedy (lines 16-17). A P

REPRINT

Algorithm 1

Ask for Q-values and use the shared maximum Q-values in a given state

Require: agent i , budget b iask , asking function P iask , and function Φ i . for each time step do let current state be as s i if b iask > then p ← getRandomV alue (0 , if p < P iask ( s i ) then broadcast state s i and the conﬁdence of Q-values derived by Φ i for each of the other agents do add the shared Q-value to collection Π if Π (cid:54) = ∅ then b iask ← b iask − Π = Γ(Π) (cid:46) each action is associated with one Q-value for each state-action pair of Π do denote Q j ( s i , a ) as the Q-value for state-action pair ( s i , a ) Q i ( s i , a ) ← Q j ( s i , a ) execute greedy exploration strategy if no action is executed then perform usual exploration strategy (i.e., (cid:15) -greedy) Algorithm 2

Share the maximum Q-value for a given state

Require: agent j , function Ψ j , budget b jgive , state s i , and function Φ i . switch from current state s j to state s i if b jgive > then a ∗ j ← arg max a Q j ( s i , a ) if Ψ j ( s i , a ∗ j ) > Φ i ( s i , a ∗ j ) then b jgive ← b jgive − return a ∗ j , Q j ( s i , a ∗ j ) and Ψ j ( s i , a ∗ j ) switch from state s i to state s j Algorithm 2 describes when agent j provides its maximum Q-values to partaker i for state s i . Aslong as budget b jgive has not been used up, agent j compares its conﬁdence of the maximum Q-valuein s i with the partaker (lines 1-3). Agent j shares its maximum Q-value if it has higher conﬁdencein the Q-value than partaker i . If agent j takes the role of sharer, budget b jgive is decremented by1 (lines 4-5). Then sharer j sends its maximum Q-value in state s i , the associated action, and theconﬁdence of the Q-value to partaker i (line 6). After that, the partaker’s action selection in state s i will be guided by the sharer. We are interested in a multiagent system where several agents cooperatively solve a task. Each agentindependently observes the environment and learns its own Q-values. The advising framework[5] using actions as advices achieves excellent results in a similar setting as our work. And this

A P

REPRINT framework has shown to surpass sharing a successful episode under budget constraint. We comparePSAF with the following approaches in our experiments.a.

Multi-IQL [11]: Each agent is independent learner and learns individual Q-function. Thereis no communication among all agents. Multi-IQL serves as a baseline method to validate thebeneﬁt of advising actions and sharing Q-values.b.

AdhocTD [5]: AdhocTD is a state-of-the-art method on action advising for multiagent learning,and the detail of this framework is illustrated in Section 3.3. We compare PSAF with AdhocTDsince we argue that sharing Q-values is the most effective way to promote the learning ofQ-learners.c.

AdhocTD-Q : We adapt AdhocTD in a straightforward way so that agents can share Q-valueswith one another. In AdhocTD-Q, agents use asking function P ask and giving function P give deﬁned by AdhocTD, respectively, to ask for Q-values and share their maximum Q-values.We evaluate PSAF, AdhocTD, AdhocTD-Q, and Multi-IQL in three cooperative games. Predator-Prey domian is a popular benchmark for multiagent learning. Half Field Offense is a more complexrobot soccer game. Spread game is a quickly deployable domain. Since current experiments onlycontain few agents to interact with each other, we adopt random selection strategy (function Γ )for each partaker in AdhocTD-Q and PSAF to utilize the shared Q-values. In all methods (exceptMulti-IQL), the number of times that an agent asks for and provides Q-values (or actions) arelimited by budgets b ask and b give respectively. When we report that the difference between twocurves is signiﬁcant, it means we have at least 95% conﬁdence that one curve has larger area byusing t-tests on their areas with α = 0.05. (a) Predator-Prey domain (b) The prey is caught. Figure 1: Left: Predator-Prey domain with four predators and one prey. The red one is a predator,and the blue one is a prey. Right: Four predators are next to the prey.Predator-Prey (PP) domain is easy to implement, customize, and the results are easy to interpret.Therefore, it has been extensively used to evaluate multiagent learning algorithms [35, 36, 37]. Ourimplementation includes an N × N grid world, where N is the number of cells in x ( y ) direction. Asshown in Figure 1a, there are four predators and one prey in the grid world. Each of them occupiesone cell, and one cell is allowed to be occupied by only one agent to avoid the case of deadlock. Thefour predators and the prey can choose between ﬁve actions Stay , Go Up , Go Down , Go Left , and A P

REPRINT

Go Right . By executing an action, each agent moves cell by cell in corresponding direction. In thisgame, the prey takes a random action of the time, with rest of the time moving away from allpredators, making the task harder than with a fully random prey. Four predators are reinforcementlearning agents. They learn to cooperatively catch the prey as soon as possible. Each predator fullyobserves the relative x and y coordinates of other predators and the prey. All values of states arenormalised to [ − , by dividing by the number of cells N . The prey is caught only when fourpredators are next to the prey in four cardinal directions, as shown in Figure 1b. If the predatorscatch the prey, all predators receive a reward of 1, otherwise 0. For predators not sharing actionsor Q-values, each of them is equipped with Q ( λ ) , where λ =0.9, γ =0.9, and α =0.1. The (cid:15) -greedywith (cid:15) =0.1 is used as exploration strategy for four predators. Since all predators are learning andchanging their policies, it is very hard for them learning with Q ( λ ) . Tile coding [38, 32] is used toforce a generalization over the state space, with 8 tilings and tile-width 0.5.The PP domain has one popular metric for performance evaluation. Time to Goal (TG) is the numberof steps that predators take to catch the prey. Lower TG values means that the predators catch theprey more quickly. One episode starts when four predators and the prey are initialized with randompositions in the grid world. The episode ends when either predators catch the prey, or a time limitis exceeded. In each episode, the maximum number of steps that predators and the prey can playis 2,500. Four predators are trained for 10,000 episodes. After every 100 training episodes, theTG values for each of these episodes are gathered and averaged to get more stable values. Theevaluation process is computationally efﬁcient, and it is able to clearly present the trend in TGvalues for different methods. Then we obtain one run containing all evaluated TG values for 10,000training episodes. The process is repeated 100 times. For the parameters v a and v b of AdhocTDand AdhocTD-Q, and v a of PSAF, we ﬁrstly tune v a and v b for AdhocTD with unlimited budgetto get signiﬁcant lower TG values than Multi-IQL and to spend not very high budget. Then weapply these parameters to all other experiments. Finally, we choose v a =0.2 and v b =1 for AdhocTD,AdhocTD-Q and PSAF.In order to explore the effect of different amount of budgets on the performance of each method, weconsider three cases: (1) there are unlimited budgets, and we set b ask = b give = + ∞ ; (2) there arelimited budgets, and we set b ask = b give = b ask = b give = A P

REPRINT x100 Traning Episodes T i m e o G o a l Four Preda ors Ca ch One Prey, b=+∞PSAFAdhocTDAdhocTD-QMulti-IQL (a) TG of 10,000 episodes with b=+ ∞ x100 T aning Episodes T i m e t o G o a l Fou P edato s Catch One P ey, b=+∞PSAFAdhocTDAdhocTD-QMulti-IQL (b) TG between 50 and 200 with b=+ ∞ x100 Traning Episodes T i m e t o G o a l Four Predators Catch One Prey, b=2500PSAFAdhocTDAdhocTD-QMulti-IQL (c) TG of 10,000 episodes with b=2,500 x100 Traning Episodes T i m e t o G o a l Four Predators Catch One Prey, b=2500PSAFAdhocTDAdhocTD-QMulti-IQL (d) TG between 50 and 200 with b=2,500 [7UDQLQJ(SLVRGHV 7 L P H W R * R D O )RXU3UHGDWRUV&DWFK2QH3UH\E PSAFAdhocTDAdhocTD-QMulti-IQL (e) TG of 10,000 episodes with b=1,800 [7UDQLQJ(SLVRGHV 7 L P H W R * R D O )RXU3UHGDWRUV&DWFK2QH3UH\E PSAFAdhocTDAdhocTD-QMulti-IQL (f) TG between 50 and 200 with b=1,800

Figure 2: TG of PSAF, AdhocTD, AdhocTD-Q and Multi-IQL in PP domain for cases 1, 2 and 3.

A P

REPRINT x100 Traning Episodes B udg e Four Preda ors Ca ch One Prey, b=+∞PSAFAdhocTDAdhocTD-Q (a) The consumption of b give with b=+ ∞ x100 Traning Episodes B udg e t Four Predators Catch One Prey, b=2500PSAFAdhocTDAdhocTD-Q (b) The consumption of b give with b=2,500 [7UDQLQJ(SLVRGHV % XGJ H W )RXU3UHGDWRUV&DWFK2QH3UH\E PSAFAdhocTDAdhocTD-Q (c) The consumption of b give with b=1,800 Figure 3: The used budget b give of PSAF, AdhocTD and AdhocTD-Q in PP domain for cases 1, 2and 3. (a) The visited times of states (b) The visited times of state-action pairs Figure 4: The number of times that a partaker (student) and a sharer (teacher) visit the advised state(or state-action pair) for AdhocTD, AdhocTD-Q and PSAF in PP domain.

A P

REPRINT see that AdhocTD-Q and AdhocTD spend all budgets completely at about 2,000 and 6,000 episoderespectively, resulting in higher TG values than the case of unlimited budgets. In Figure 3c, althoughall methods quickly run out of all budget, PSAF still signiﬁcantly outperforms other methods. Allexperiments in PP domain show that sharing Q-values achieves signiﬁcant improvements comparedwith advising actions regarding TG evaluation. The performance of AdhocTD-Q and AdhocTDheavily depends on available budget. In case 1 and 2, PSAF has the best results while only needs toshare about 2,000 Q-values for each predator during the whole learning process.We now attempt to ﬁgure out when predators share Q-values or actions in AdhocTD, AdhocTD-Qand PSAF during learning. Since these methods encode the number of times an agent visits astate (or a state-action pair), we investigate how many times the advised states are visited bypartakers (students) and sharers (teachers) in all sharing processes for a single run. Note that thebudget is unlimited. As shown in Figure 4a, we can see that AdhocTD, AdhocTD-Q and PSAFhave similar distribution of sharing opportunities regarding the visited times of states in whichQ-values (or actions) are shared. Most sharing processes are initiated in states that a partaker(student) visits very few times. Due to the probabilistic functions both in asking and giving, anagent equipped with AdhocTD and AdhocTD-Q still have opportunities to be suggested even whenit has visited current state many times. In Figure 4b, when a Q-value (or an action) is shared in asharing process, we record the number of times that the partaker (student) and the sharer (teacher)visit the corresponding state-action pair. We can see that in PSAF, most Q-values are shared instate-action pairs that partakers visit very few times, while sharers visit many more times. However,in AdhocTD-Q and AdhocTD, Q-values or actions are shared even when partakers (students) visitcorresponding state-action pairs very often. Figure 4 shows that sharing Q-values to a partaker ismore helpful when it rarely updates the Q-values, and a sharer should make sure that the Q-value tobe shared has been sufﬁciently learned.

Figure 5: Half Field Offense. Three players against one goalkeeper.Half Field Offense (HFO) [34] is a subtask of simulated robot soccer game. Our implementation isbased on the work of Silva et al. [5], where HFO has been tested by AdhocTD to achieve promisingresults. As shown in Figure 5, there are three RL players learning to score a goal against onegoalkeeper in the ﬁeld. The goalkeeper is an automated agent using a policy derived from Helios,the 2012 RoboCup 2D champion team [39]. When a player dose not have the ball, its only optionis

Move , which is an automated action guided by Helios. When the player has the ball, it learnsto choose between four actions

Shoot , PassNear , PassFar , and

Dribble . The three players beneﬁt The source code is available at https://github.com/f-leno/AdHoc_AAMAS-17.

A P

REPRINT from cooperative behaviours, e.g., one player passes the ball to another player for better shot. In ourexperiments, the state of a player is composed of following observations.(1)

Able to Kick : Boolean indicating if the player can kick the ball.(2)

Goal Center Proximity : The player’s proximity to the center of the goal.(3)

Goal Center Angle : Angle from the player to the center of the goal.(4)

Goal Opening Angle : The largest open angle of the player to the goal with no other blockingplayers.(5)

Teammate 1’s Goal Opening Angle : The goal opening angle of the player’s nearest teammate.(6)

Teammate 2’s Goal Opening Angle : the goal opening angle of the player’s farthest teammate.These state features are normalized in the range [ − , . They are discretized by Tile Coding[38, 32], where the number of tiles is 10, and tile width is 0.5. One episode starts when the playersand the ball initialized with random positions on the ﬁeld. The episode ends when either a playerscores a goal, the goalkeeper catches the ball, the ball leaves the ﬁeld, or a time limit is exceeded.In each episode, the maximum number of steps that players can perform is 200. We consider thatthree players cooperatively learn to score the ball. All players receive a common reward of when aplayer scores a goal, − when the ball is caught by the goalkeeper or the ball is out of bounds, and when players reach the maximum time step in current episode. We train the three players for 10,000episodes. At every 20 training episodes, 100 testing episodes are played. During testing episodes,all players are unable to learn and share Q-values. They only execute best actions for evaluatingcurrently learned policies. The whole process is repeated 50 runs. We use two standard evaluationmetrics in HFO. The primary metric, Goal Percentage (GP), is the percentage of all testing episodeswhich end with a goal being scored. The second metric,

Time to Goal (TG), is the average numberof steps required to score in all testing episodes that culminates with a goal. When players do notshare Q-values or actions during learning process, they use SARSA( λ ), where λ =0.9, α =0.1, and γ =0.9. The exploration strategy for all players is (cid:15) -greedy with (cid:15) = 0.1. For v a and v b in AdhocTD,we use the same value in [5], which has shown satisﬁed results. Then we set v a =0.5 and v b =1.5 forAdhocTD, AdhocTD-Q and PSAF. Similar to the PP domain, we test three cases for all methods:(1) b ask = b give = + ∞ ; (2) b ask = b give = b ask = b give = A P

REPRINT

125 Q-values in the 10,000 training episodes, while achieves similar (even better) performancecomparing with AdhocTD-Q.The distribution of sharing opportunities in HFO is very different to the PP domain. As shownin Figure 9, we record the visited times of advised states and corresponding state-action pairsfor AdhocTD, AdhocTD-Q and PSAF with unlimited budget. We can see that most Q-values (oractions) are shared when a partaker (or a student) visits current state very few times and a sharer(or a teacher) visits the state many times. In Figure 9b, it is interesting to note that the teacher inAdhocTD and AdhocTD-Q can give advice when it explores corresponding state-action pairs manymore times than the student. This might explain why AdhocTD-Q achieves similar results as PSAF,and AdhocTD is much better than Multi-IQL when the budget is unlimited in HFO, while not inthe PP domain. Nevertheless, PSAF consumes much less budget than other methods since sharingopportunities in PSAF concentrate around the states (or state-action pairs) that a partaker rarelyvisits.

Spread is a grid world game with discrete space and time, in which cooperative agents mustspread and cover a number of landmarks, in order to maximise their shared rewards. Despite thisenvironment’s representational and mechanical simplicity, it still is capable of presenting complexbehavioural challenges for MARL, for example, agents need to cover landmarks without overlap.We conduct experiments in spread domain to focus on behavioural learning while keeping thecomputational expense at minimum. Our implementation is based on [9]. As shown in Figure10, we consider that the grid world is sized × and consists of 2 agents and 2 landmarks. Thepositions of the landmarks are ﬁxed. In this game, the agents can overlap, and ignore each otherwhen occupying the same cell, while ﬁnally they should learn to cover the landmarks as many aspossible. The agents navigate around the grid world by using a discrete set of actions Stay , MoveUp , Move Down , Move Left , and

Move Right . Movement actions executed by an agent change itsposition on the grid world by 1 cell in the corresponding direction. The agents learn to cover alllandmarks as quickly as possible. At every time step, each agent observes the relative positionsof the other agent and landmarks in form of x -axis and y -axis values. All values of states arenormalised to [ − , by dividing by the size of the grid world. One episode begins when the agentsand landmarks are initialized with random positions in the grid world. The episode ends whenagents cover all landmarks or the maximum time step 20 of each episode is reached. When twoagents cover the two landmarks, they receive a common reward 1. When only one agent cover onelandmark, both of the two agents receive reward 0. When no landmark is covered, then the agentsreceive punishment -1. In order to act optimally and maximise the total reward, the agents mustmove to the appropriate landmark following the shortest path.The performance of the agents is assessed by another classical evaluation metric Average Rewardper Step (ARS). ARS is an agent’s average reward of each time step in an episode. The two agentsare trained for 50,000 episodes. At every 100 training episodes, a set of 100 evaluation episodes isperformed. During evaluation, learning and sharing process are disabled for agents. They only usecurrently learned optimal policy. The reward received by an agent at every time step is discounted by γ =0.9, and averaged for one evaluation episode. Then we obtain ARS for the agent by averaging the The source code is available at https://github.com/chauncyzhu/spreadgame.

A P

REPRINT

Training Episodes G o a l % PSAFAdhocTD AdhocTD-QMulti-IQL (a) GP with b ask = b give = + ∞ Training Episodes G o a l % PSAFAdhocTD AdhocTD-QMulti-IQL (b) GP between 7,000 and 10,000 episodes in Figure(a)

Training Episodes G o a l % PSAFAdhocTD AdhocTD-QMulti-IQL (c) GP with b ask = b give = Training Episodes G o a l % PSAFAdhocTD AdhocTD-QMulti-IQL (d) GP between 7,000 and 10,000 episodes in Figure(c)

Training Episodes G o a l % PSAFAdhocTD AdhocTD-QMulti-IQL (e) GP with b ask = b give = Training Episodes G o a l % PSAFAdhocTD AdhocTD-QMulti-IQL (f) GP between 7,000 and 10,000 episodes in Figure(e)

Figure 6: GP of PSAF, AdhocTD, AdhocTD-Q and Multi-IQL in HFO for cases 1, 2 and 3.

A P

REPRINT

Training Episodes T i m e t o G o a l PSAFAdhocTD AdhocTD-QMulti-IQL (a) TG with b ask = b give = + ∞ Training Episodes T i m e t o G o a l PSAFAdhocTD AdhocTD-QMulti-IQL (b) TG between 7,000 and 10,000 episodes in Figure(a)

Training Episodes T i m e t o G o a l PSAFAdhocTD AdhocTD-QMulti-IQL (c) TG with b ask = b give = Training Episodes T i m e t o G o a l PSAFAdhocTD AdhocTD-QMulti-IQL (d) TG between 7,000 and 10,000 episodes in Figure(c)

Training Episodes T i m e t o G o a l PSAFAdhocTD AdhocTD-QMulti-IQL (e) TG with b ask = b give = Training Episodes T i m e t o G o a l PSAFAdhocTD AdhocTD-QMulti-IQL (f) TG between 7,000 and 10,000 episodes in Figure(e)

Figure 7: TG of PSAF, AdhocTD, AdhocTD-Q and Multi-IQL in HFO for cases 1, 2 and 3.

A P

REPRINT

Training Episodes B udg e t PSAFAdhocTDAdhocTD-Q (a) Unlimited budget

Training Episodes B udg e t PSAFAdhocTDAdhocTD-Q (b) The budget is 150

Training Episodes B udg e t PSAFAdhocTDAdhocTD-Q (c) The budget is 50

Figure 8: The consumed budget b give of PSAF, AdhocTD and AdhocTD-Q in HFO for cases 1, 2and 3. (a) The visited times of states (b) The visited times of state-action pairs Figure 9: The number of times that a partaker (student) and a sharer (teacher) visit the advised state(or state-action pair) for AdhocTD, AdhocTD-Q and PSAF in HFO.

A P

REPRINT (a) Spread game (b) Two agents cover twolandmarks

Figure 10: Spread game with two agents and two landmarks. The grey one is a landmark, and theblue (or green) one is an agent.discounted rewards in the 100 evaluation episodes. The overall process is repeated 1,000 times tosmooth the curve of performance. When no Q-value (or action) is shared to agents during learning,they use SARSA( λ ) with λ =0.9, α =0.1, and γ =0.9. In this paper, we use v a = v b =0.5 forAdhocTD, AdhocTD-Q and PSAF. Each agent uses (cid:15) -greedy strategy as exploration strategy with (cid:15) =0.1. Similar to the PP domain and HFO, three cases are tested for AdhocTD, AdhocTD-Q, PSAFand Multi-IQL: (1) b ask = b give = + ∞ ; (2) b ask = b give =3,000; (3) b ask = b give = In this paper, we emphasize the importance of the amount of budget on agents’ learning. Ifthe available budget is high, both advising actions and sharing Q-values can beneﬁt from the

A P

REPRINT x100 Traning Episodes −0.35−0.30−0.25−0.20−0.15−0.10−0.05 A v e r a g e r e w a r d p e r s t e p Two agents cover two landmarks, b=+∞PSAFAdhocTDAdhocTD-QMulti-IQL (a) ARS with b ask = b give = + ∞ x100 Traning Episodes −60−50−40−30−20−100 A u c s c o r e T o agents cover t o landmarks, b=+∞PSAFAdhocTDAdhocTD-QMulti-IQL (b) AUC of ARS with b ask = b give = + ∞ x100 Traning Episodes A v e r a g e r e w a r d p e r s t e p Two agents cover two landmarks, b=3000PSAFAdhocTDAdhocTD-QMulti-IQL (c) ARS with b ask = b give =3,000 x100 Traning Episodes −60−50−40−30−20−100 A u c s c o r e Two agents cover two landmarks, b=3000PSAFAdhocTDAdhocTD-QMulti-IQL (d) AUC of ARS with b ask = b give =3,000 x100 Traning Episodes A v e r a g e r e w a r d p e r s t e p Two agents cover two landmarks, b=1500PSAFAdhocTDAdhocTD-QMulti-IQL (e) AUC of ARS with b ask = b give = x100 Traning Episodes −60−50−40−30−20−100 A u c s c o r e Two agents cover two landmarks, b=1500PSAFAdhocTDAdhocTD-QMulti-IQL (f) AUC of ARS with b ask = b give = Figure 11: ARS and AUC value of PSAF, AdhocTD, AdhocTD-Q and Multi-IQL in Spread forcases 1, 2 and 3.

A P

REPRINT x100 Traning Episodes B udg e t Two agents cover two landmarks, b=+∞PSAFAdhocTDAdhocTD-Q (a) Unlimited budget x100 Traning Episodes B udg e t Two agents cover two landmarks, b=3000PSAFAdhocTDAdhocTD-Q (b) b ask = b give =3,000 x100 Traning Episodes B udg e t Two agents cover two landmarks, b=1500PSAFAdhocTDAdhocTD-Q (c) b ask = b give =1,500 Figure 12: The used budget of PSAF, AdhocTD and AdhocTD-Q in Spread for cases 1, 2 and 3. (a) The visited times of states (b) The visited times of state-action pairs

Figure 13: The number of times that a partaker (student) and a sharer (teacher) visit the advisedstate (or state-action pair) for AdhocTD, AdhocTD-Q and PSAF in Spread.

A P

REPRINT more experienced teacher since agents have enough opportunities to be guided. When the budgetdecreases, agents may stop to ask for action advice or Q-values at the early stage of learning. Weobserve that in the latter case, sharing Q-values still achieve better performance on the evaluatedmetrics. Using the Q-values from teachers enables the student to choose a better action when it hasno budget to ask for help. However, if the budget is even lower, which means that all budgets canbe used up at the very beginning, the learning performance will be dropped as the action advice orQ-values are likely to be non-optimal. (a) (b)

Figure 14: Two examples of the Predator-Prey domain. Left: the state of predator 1 is (cid:104) , , , (cid:105) .Right: the state of predator 1 is (cid:104)− , , − , (cid:105) We try to understand why sharing Q-values outperforms advising actions and how sharing processesinﬂuence agents’ learning, which is unexplored in previous works. In this section, we use the PPdomain with two predators and one prey, considering that it is quite easy to record and reconstructthe learning process compared to HFO. Now the prey is caught when one predator occupies thesame cell as the prey, and another predator is next to the prey in four cardinal directions. Thissetting makes the predators have more possibilities to achieve the goal. Then we choose one typicalrun from the PP domain and select two cases where if the predators take right actions, they cansuccessfully catch the prey, as shown in Figures 14a and 14b. The states of predator 1 in two casesare (cid:104) , , , (cid:105) and (cid:104)− , , − , (cid:105) respectively. In Figure 15a, we record the Q-values of predator 1with AdhocTD in state (cid:104) , , , (cid:105) after the predator receiving its ﬁrst action advice at 4,213 step of1,090 episode. Similarly, the Q-values of predator 1 with PSAF since it ﬁrstly receives a Q-valuefrom predator 2 at 1,550 step of 1,216 episode are shown in Figure 15b. We can see that predator 2always advises predator 1 to go up in state (cid:104) , , , (cid:105) . Intuitively, if predator 1 executes Up , thetwo predators are more likely to catch the prey. After predator 1 has been advised 7 times, itsQ-value of action Up becomes larger and larger, which may be caused by the dynamic environment.However, in Figure 15b, predator 2 only provides predator 1 with one Q-value, which correspondsto action Up . After predator 1 updates the Q-value corresponding to action Up , the Q-value simplybecomes the maximum one in state (cid:104) , , , (cid:105) . Then, the learning speed of predator 1 with PSAF isaccelerated and the budget can be saved.The example in state (cid:104)− , , − , (cid:105) is shown in Figure 16. Intuitively, predator 1 should take action Up or Right so that the two predators are more likely to successfully catch the prey. When predator1 receives the ﬁrst action advice or shared Q-value in state (cid:104)− , , − , (cid:105) , we record the Q-valuesof AdhocTD and PSAF in Figure 16a and 16b respectively. In Figure 16a, we can see that predator2 mostly advises action Left (corresponding to its maximum Q-value) in the beginning. After that,predator 2 advises action Up to predator 1 two times. Then the Q-value corresponding to action Up A P

REPRINT (a) Predator 1 with AdhocTD (b) Predator 1 with PSAF

Figure 15: Q-values of predator 1 with AdhocTD (a) and PSAF (b) in state (cid:104) , , , (cid:105) . One solidpoint in Figure (a) (or (b)) is an advised action (or a shared Q-value) from predator 2. (a) Predator 1 with AdhocTD (b) Predator 1 with PSAF Figure 16: Q-values of predator 1 with AdhocTD (a) and PSAF (b) in state (cid:104)− , , − , (cid:105) . Onesolid point in Figure (a) (or (b)) is an advised action (or a shared Q-value) from predator 2.of predator 1 becomes larger and larger. Figure 12b shows that predator 1 only receives 4 Q-valuesin total. At the early stage, predator 1 may receive some non-optimal Q-values, e.g., the Q-valuecorresponding to action Left . However, predator 1 with PSAF can learn to proceed to the rightdirection ﬁnally.Experimental results show that sharing Q-values performs better than advising actions, particularly,in the cooperative MARL with budget constraint. We highlight the necessity of cooperative settingand budget constraint due to two reasons. Firstly, in the cooperative MARL, as agents learn tocomplete the same goal in a shared environment, they are more likely to exchange their knowledgewithout speciﬁc requirements for transforming internal representations, e.g., states, actions and valuefunctions. The second reason is that if the communication is unconstrained, agents can exchangetheir learned knowledge at every time step. Then, the multiagent learning would be reduced to thesingle-agent learning, which is less effective at scaling to multiagent systems with a great number ofagents, and decreases computational speed. Generally, sharing Q-values is more effective when theQ-values are mainly shared in states where a partaker explores very few times, while a sharer has

A P

REPRINT learned many times. The assumption behinds this view is that the sharer’s Q-values rarely hinderthe learning of a partaker. Although some Q-values from the sharer may be non-optimal at thevery early stage of learning, the partaker still can adjust them as the updated times of the Q-valuesincrease. However, some tasks may have stochastic rewards, making it become much more difﬁcultfor each agent to learn the optimal policy. Furthermore, the Q-values in some states can drasticallychange during learning. In the circumstances, sharing non-optimal Q-values to a partaker may takeit a lot of time to adjust its policy and stabilize the performance. PSAF is likely to be used in twosituations: (1) an agent arrives a state that the agent rarely visits so that it nearly has no informationabout the state, while its teammates who have visited the state many times; (2) an agent joins asystem in which some agents have already learned a period of time, and their performance becomesstable.

We here propose a Q-values sharing framework PSAF for multiple decentralized Q-learners learningwith budget constraint. In PSAF, if a learning agent visits current state very few times, it is morelikely to take the role of partaker. Then the other agents share their Q-values when they have moreconﬁdence than the partaker. The relation of a partaker and a sharer is established only when thesharer’s Q-values are expected to be useful for the partaker. For this purpose, we utilize the askingfunction P ask deﬁned by AdhocTD to decide when a partaker can ask for Q-values, and proposetwo conﬁdence functions for a partaker and a sharer, respectively, evaluating their own conﬁdencein a particular Q-value. Concretely, a sharer is required to update its maximum Q-value many moretimes than a partaker, in order to provide a more beneﬁcial Q-value for the partaker, and use thebudget more effectively. We conduct experiments in Predator-Prey domain, Half Field Offense andSpread game for three cases: (1) the budget is unlimited; (2) the budget is limited for the otheradvising frameworks; (3) the budget is even lower for all methods including PSAF. Evaluationresults show that the schemes sharing Q-values, such as PSAF and AdhocTD-Q, are signiﬁcantlybetter than sharing actions in several metrics, yet PSAF spends much less budget than all othermethods. When the budget is unlimited, AdhocTD-Q can achieve similar performance to PSAFin Half Fielf Offense, while has worse results than PSAF in three tasks when the available budgetdecreases. Moreover, the performance of advising actions heavily depends on the amount of budget.In contrast, PSAF only needs to share a small number of Q-values.In the present work we randomly select one Q-value for a partaker when it receives several Q-valuesfrom multiple knowledge resources. Ongoing work is devoted to study how to select the mosteffective one, especially when there are hundreds of sharers to response the partaker’s request. Inparticular, the partaker can ask for Q-values from a small set of potential sharers. In this paper,we focus on a tabular representation of Q-function. Even so, PSAF is able to be adapted to someproblems with continuous state spaces, where a function approximator is used. One possible wayis to incorporate the shared Q-values for a partaker in a Q-learning like rule. Then PSAF can becombined with deep Q-network [40] to handle the high-dimensional sensory inputs. Furthermore,PSAF can be linked to many learning algorithm, such as double Q-learning [41], weighted doubleQ-learning [42], and an extension Multi Q-Learning [43]. In our work, all agents are learning fromscratch in a shared environment. Another branch of future works can discover sharing Q-valuesamong RL agents with different levels of expertise. A P

REPRINT

ACKNOWLEDGEMENT

This work presented in this paper is partially supported by a CUHK Direct Grant for Research(Project Code EE16963), the Fundamental Research Funds for the Central Universities, SCUT(Nos. 2017ZD048, D2182480), the Tiptop Scientiﬁc and Technical Innovative Youth Talents ofGuangdong special support program (No. 2015TQ01X633), the Science and Technology PlanningProject of Guangdong Province (No. 2017B050506004), the Science and Technology Program ofGuangzhou (Nos. 201704030076, 201802010027).

References [1] Hiroaki Kitano, Minoru Asada, Yasuo Kuniyoshi, Itsuki Noda, Eiichi Osawa, and HitoshiMatsubara. Robocup: A challenge problem for ai.

RoboCup-97: Robot Soccer World Cup I ,pages 1–19, 1997.[2] Roohollah Amiri, Hani Mehrpouyan, Lex Fridman, Ranjan K. Mallik, Arumugam Nallanathan,and David Matolak. A machine learning approach for power allocation in hetnets consideringqos. In

IEEE International Conference on Communications , pages 1–7, 2018.[3] Andrei Marinescu, Ivana Dusparic, and Siobhán Clarke. Prediction-based multi-agent re-inforcement learning in inherently non-stationary environments.

ACM Transactions on Au-tonomous and Adaptive Systems (TAAS) , 12(2):9:1–9:23, 2017.[4] Lisa Torrey and Matthew E. Taylor. Teaching on a budget: agents advising agents in reinforce-ment learning. In

Proceedings of 12th the International Conference on Autonomous Agentsand MultiAgent Systems , pages 1053–1060, 2013.[5] Felipe Leno da Silva, Ruben Glatt, and Anna Helena Reali Costa. Simultaneously learningand advising in multiagent reinforcement learning. In

Proceedings of the 16th InternationalConference on Autonomous Agents and MultiAgent Systems , pages 1100–1108, 2017.[6] Jianye Hao, Ho fung Leung, and Zhong Ming. Multiagent reinforcement social learningtoward coordination in cooperative multiagent systems.

ACM Transactions on Autonomousand Adaptive Systems (TAAS) , 9(4):20:1–20:20, 2015.[7] Jayesh K. Gupta, Maxim Egorov, and Mykel J. Kochenderfer. Cooperative multi-agent controlusing deep reinforcement learning. In

International Conference on Autonomous Agents andMultiagent Systems , pages 66–83, 2017.[8] Laëtitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Independent reinforcementlearners in cooperative markov games: a survey regarding coordination problems.

KnowledgeEng. Review , 27(1):1–31, 2012.[9] Ercüment Ilhan, Jeremy Gow, and Diego Pérez-Liébana. Teaching on a budget in multi-agentdeep reinforcement learning. In

IEEE Conference on Games , pages 1–8, 2019.[10] Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. Adeep policy inference q-network for multi-agent systems. In

Proceedings of the InternationalJoint Conference on Autonomous Agents and Multiagent Systems , pages 1388–1396, 2018.[11] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperativemultiagent systems. In

The National Conference on Artiﬁcial Intelligence , pages 746–752,1998.

A P

REPRINT [12] Matthew E. Taylor, Peter Stone, and Yaxin Liu. Transfer learning via inter-task mappings fortemporal difference learning.

Journal of Machine Learning Research , 8:2125–2167, 2007.[13] Felipe Leno da Silva and Anna Helena Reali Costa. A survey on transfer learning for multiagentreinforcement learning systems.

Journal of Artiﬁcial Intelligence Research , 64:645–703, 2019.[14] Jelle R. Kok and Nikos Vlassis. Sparse tabular multiagent q-learning. In

Proceedings of theAnnual Machine Learning Conference of Belgium and The Netherlands , pages 65–71, 2004.[15] Yann-Michaël De Hauwere, Peter Vrancx, and Ann Nowé. Learning multi-agent state spacerepresentations. In

Proceedings of the International Joint Conference on Autonomous Agentsand Multiagent Systems , volume 2, pages 715–722, 2010.[16] Yann-Michaël De Hauwere, Peter Vrancx, and Ann Nowé. Solving sparse delayed coordinationproblems in multi-agent reinforcement learning. In

Adaptive and Learning Agents , pages114–133. Springer Berlin Heidelberg, 2011.[17] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In , pages 330–337, 1993.[18] Adam Taylor, Ivana Dusparic, Edgar Galván López, Siobhán Clarke, and Vinny Cahill.Accelerating learning in multi-objective systems through transfer learning. In

InternationalJoint Conference on Neural Networks (IJCNN) , pages 2298–2305, 2014.[19] Adam Taylor, Ivana Dusparic, Maxime Guériau, and Siobhán Clarke. Parallel transfer learningin multi-agent systems: What, when and how to transfer? In

International Joint Conferenceon Neural Networks (IJCNN) , pages 1–8, 2019.[20] Majid Nili Ahmadabadi and Masoud Asadpour. Expertness based cooperative q-learning. In

IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics , volume 32, pages66–76, 2002.[21] Lisa Torrey and Matthew E. Taylor. Help an agent out: Student/teacher learning in sequentialdecision tasks. In

Proceedings of the 2012 Adaptive and Learning Agents Workshop (ALA) ,pages 41–48, 2012.[22] Bryan Cunningham and Yong Cao. Non-reciprocating sharing methods in cooperative q-learning environments.

Proceedings of the The 2012 IEEE/WIC/ACM International JointConferences on Web Intelligence and Intelligent Agent Technology , 2:212–219, 2012.[23] Javier García and Fernando Fernández-Rebollo. Probabilistic policy reuse for safe reinforce-ment learning.

ACM Transactions on Autonomous and Adaptive Systems (TAAS) , 13(3):14:1–14:24, 2019.[24] Shayegan Omidshaﬁei, Dong-Ki Kim, Miao Liu, Gerald Tesauro, Matthew Riemer, Christo-pher Amato, Murray Campbell, and Jonathan P. How. Learning to teach in cooperativemultiagent reinforcement learning. pages 6128–6136, 2019.[25] Chongjie Zhang, Sherief Abdallah, and Victor R. Lesser. Efﬁcient multi-agent reinforcementlearning through automated supervision. In , pages 1365–1370, 2008.[26] Ana L. C. Bazzan, Denise de Oliveira, and Bruno Castro da Silva. Learning in groups of trafﬁcsignals.

Engineering Applications of Artiﬁcial Intelligence , 23(4):560–568, 2010.

A P

REPRINT [27] Matthieu Zimmer, Paolo Viappiani, and Paul Weng. Teacher-student framework: A reinforce-ment learning approach. In

AAMAS Workshop Autonomous Robots & Multirobot Systems ,2014.[28] Anestis Fachantidis, Matthew E. Taylor, and Ioannis P. Vlahavas. Learning to teach reinforce-ment learning agents.

Machine Learning and Knowledge Extraction , 1(1):21–42, 2019.[29] Jeffery A. Clouse.

On integrating apprentice learning and reinforcement learning . PhD thesis,University of Massachusetts, 1996.[30] Ofra Amir, Ece Kamar, Andrey Kolobov, and Barbara J. Grosz. Interactive teaching strategiesfor agent training. In

Proceedings of the Twenty-Fifth International Joint Conference onArtiﬁcial Intelligence (IJCAI) , pages 804–811, 2016.[31] Frans A. Oliehoek and Christopher Amato. A concise introduction to decentralized pomdps.In

SpringerBriefs in Intelligent Systems , 2016.[32] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: An Introduction . MITpress, Cambridge, MA, USA, 1nd. edition, 1998.[33] Christopher J.C.H. Watkins and Peter Dayan. Technical note: Q-learning.

Machine Learning ,8:279–292, 1992.[34] Matthew Hausknecht, Prannoy Mupparaju, Sandeep Subramanian, Shivaram Kalyanakrishnan,and Peter Stone. Half ﬁeld offense: An environment for multiagent learning and ad hocteamwork. In

AAMAS Adaptive Learning Agents (ALA) Workshop , 2016.[35] Tim Brys, Ann Nowé, Daniel Kudenko, and Matthew E. Taylor. Combining multiple correlatedreward and shaping signals by measuring conﬁdence. In

Proceedings of 28th AAAI Conferenceon Artiﬁcial Intelligence , pages 1687–1693, 2014.[36] Ariel Rosenfeld, Matthew E. Taylor, and Sarit Kraus. Speeding up tabular reinforcementlearning using state-action similarities. In

Proceedings of the International Joint Conferenceon Autonomous Agents and Multiagent Systems , pages 1722–1724, 2017.[37] Hoang Minh Le, Yisong Yue, Peter Carr, and Patrick Lucey. Coordinated multi-agent imitationlearning. In

Proceedings of the 34th International Conference on Machine Learning , pages1995–2003, 2017.[38] Alexander A. Sherstov and Peter Stone. Function approximation via tile coding: Automatingparameter choice. In

Proc. Symposium on Abstraction, Reformulation, and Approximation(SARA-05) , pages 194–205, 2005.[39] Hidehisa Akiyama. Agent2d base code. https://zh.osdn.net/projects/rctools/ .2012.[40] Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcementlearning. In

Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence , pages 2140–2146, 2017.[41] Hado van Hasselt. Double q-learning. In

Advances in Neural Information Processing Systems23 , pages 2613–2621. 2010.[42] Zongzhang Zhang, Zhiyuan Pan, and Mykel J. Kochenderfer. Weighted double q-learning.In

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence ,pages 3455–3461, 2017.

A P

REPRINT [43] Ethan Duryea, Michael Ganger, and Wei Hu. Exploring deep reinforcement learning withmulti q-learning. In

Intelligent Control & Automation , volume 07, pages 129–144, 2016., volume 07, pages 129–144, 2016.