A Q-values Sharing Framework for Multiagent Reinforcement Learning under Budget Constraint
AA Q-
VALUES S HARING F RAMEWORK FOR M ULTIAGENT R EINFORCEMENT L EARNING UNDER B UDGET C ONSTRAINT
A P
REPRINT
Changxi Zhu
South China University of Technology [email protected]
Ho-fung Leung
The Chinese University of Hong Kong [email protected]
Shuyue Hu
National University of Singapore [email protected]
Yi Cai
South China University of Technology [email protected]
December 1, 2020 A BSTRACT
In teacher-student framework, a more experienced agent (teacher) helps acceleratethe learning of another agent (student) by suggesting actions to take in certain states.In cooperative multiagent reinforcement learning (MARL), where agents need tocooperate with one another, a student may fail to cooperate well with others evenby following the teachers’ suggested actions, as the polices of all agents are everchanging before convergence. When the number of times that agents communicatewith one another is limited (i.e., there is budget constraint), the advising strategythat uses actions as advices may not be good enough. We propose a partaker-shareradvising framework (PSAF) for cooperative MARL agents learning with budgetconstraint. In PSAF, each Q-learner can decide when to ask for Q-values and shareits Q-values. We perform experiments in three typical multiagent learning problems.Evaluation results show that our approach PSAF outperforms existing advisingmethods under both unlimited and limited budget, and we give an analysis of theimpact of advising actions and sharing Q-values on agents’ learning.
Keywords multiagent reinforcement learning · cooperative learning · Q-learner · knowledge sharing Many real-world tasks, e.g., robotic games [1] and distributed power allocation [2], involve a set ofagents in a common environment. By learning from trial-and-error, an agent adapts its behaviorin an unfamiliar environment, which forms a popular mechanism Reinforcement Learning (RL).However, RL methods require a long period of interactions with the environment, resulting inlimited scalability and applicability in complex situations. When RL is used in a Multiagent System a r X i v : . [ c s . M A ] N ov artaker-Sharer Advising Framework A P
REPRINT (MAS), which is also referred as Mutiagent Reinforcement Learning (MARL), these problems arefurther intensified since all agents are changing and adapting their behaviors [3], thereby increasingthe amount of interactions needed. To know whether an action is optimal or not in the MARL, anagent must execute the action, often many times. When the computing resource of agents is limited,the time they spend should be used more efficiently over entire lifetime.One natural solution to speed up the learning process of RL agents is via knowledge sharing , relyingon communicating various pieces of external knowledge, such as policies, value functions andepisodes (state, action, quality triplets) among them. When an agent knows rarely about currentenvironment, there is no doubt that it can reuse knowledge from other agents who have acquiredsome skills in advance to reduce the time to explore. There are three problems when agentscommunicate with one another: (1) deciding to share what kind of knowledge; (2) deciding whento share knowledge, especially when communication is limited due to cost; (3) deciding how tointegrate received knowledge. Recently, one notable approach is the teacher-student framework,which focuses on
Action Advising [4, 5]. In this framework, a more experienced agent or humanexpert (teacher) aids the learning of another agent (student) by suggesting actions to take in certainstates. The advising opportunities are established jointly by teacher and student, which is suitablefor the case of limited communication. In order to model communication cost, the number oftimes that a student asks for action advice and a teacher provides advice are constrained by twobudgets respectively, i.e., budget constraint. The framework has achieved promising results onboth single-agent and multi-agent problems. In this paper, we consider that multiple RL agentscooperatively try to solve a task, where they learn to cooperate with each other by coherentlychoosing their individual actions based on their own observations, such that the resulting joint actionis optimal. Each agent learns to act optimally by maximising its Q-function, which implies thecumulative discounted rewards for every state. Under the circumstance, however, a student may failto cooperate well with others even by following a teacher’s suggested actions, since all other agentsare learning and adjusting their own polices as training progresses. More importantly, it is stillhard for the student to learn, specifically, the Q-values corresponding to the advised actions. Theproblem of the teacher-student framework applied in cooperative MARL is that advising actions toa student do not inherently change the student’s policy. The student needs to take a long period toadapt its behaviours for other agents in the changing environment.Sharing knowledge is of utmost importance for cooperative agents learning in a shared environment.In such case, if they are equipped with the same learning structure, they most likely becomeinterchangeable, i.e. they have identical optimal policies and value functions. Without knowledgemapping, an agent can directly utilize the knowledge, e.g., Q-values, from more experienced agentsas its own. If a student is able to choose next action upon its teachers’ Q-values in current state,then it can act without the necessary of taking more time to learn for that state. Using the Q-valuesfrom a teacher requires that the agents involved in advising relations are reinforcement learners andhave similar (even the same) reward functions. Although it may sacrifice flexibility compared withaction advising, nevertheless, we here argue that sharing Q-values is more effective than advisingactions for a cooperative team of RL agents. The Q-values from other agents guide an agent’sexploitation of the team’s currently learned Q-values while not only learning the optimal Q-functionby itself. Another critical consideration in our work is that sharing Q-values for all agents at everytime step need time and can be costly, especially for some multiagent systems composed of a largepopulation of agents, such as in the field of social learning [6]. Similar to the advising budget in
A P
REPRINT teacher-student framework, we assume that each agent are constrained by two budgets for askingfor and giving Q-values. It is vital to decide proper time to share Q-values.In this paper, we present a partaker-sharer advising framework (PSAF) for cooperative MARLagents to share Q-values under budget constraint, where only a limited number of Q-values can beshared in the whole learning process. In PSAF, a more experienced agent (sharer) can provide theother agent (partaker) with the maximum Q-value for the state of the partaker, which elucidates thesharer’s best action. Each agent can play the role of partaker or sharer in different sharing processes.An agent can decide when to start asking for Q-values according to how many time it has exploredfor current state. That is, if it visits current state very few times, it can initiate a sharing process,take the role of partaker, and send a request to all other agents. In order to prevent from sharingeven worse Q-values to the partaker, another agent will evaluate how confident it is in the partaker’sstate. If the agent has updated its maximum Q-value many more times than the partaker, then it canjoin in the partaker’s sharing process, take the role of sharer, and share its maximum Q-value. Thereare two numeric budgets of each agent, for requesting and providing Q-values respectively.Our contributions in this paper are threefold. First, we propose a novel Q-values sharing frameworkPSAF for cooperative MARL agents learning with budget constraint. Since agents only can share alimited number of Q-values, each agent in PSAF should decide when to ask for Q-values and shareits Q-values, and how to integrate the shared Q-values into its own Q-function. Second, we highlightthat the amount of budget is essential for agents to advise actions as well as sharing Q-values, andintroduce new metric for the sharer to decide sharing opportunities, in order to use the budget moreefficiently. We also explore the effect of different amount of budgets on the performance of allcomparing methods. Third, we show that our proposed framework PSAF significantly acceleratesthe learning process of agents in three typical multiagent learning problems: Predator-Prey domain,Half Field Offense and Spread game. We try to figure out when agents share Q-values and actions.The distribution of sharing opportunities shows that in PSAF, most Q-values are shared when apartaker has updated the Q-values for corresponding state-action pairs very few times while a sharerhas updated many times. Besides, we compare and contrast the effects of sharing Q-values andadvising actions on agents’ learning, which has not been investigated in previous works.The remainder of this article is organized as follows. Section 2 describes related work for sharingQ-values and the teacher-student framework. In Section 3, we present the background informationon multiagent reinforcement learning, temporal difference learning algorithms with eligibility traces,and a jointly-initiated framework. Section 4 formally presents our method PSAF, including when toask for and give Q-values, and how to use the received Q-values. In section 5, we conduct empiricaldemonstrations in Predator-Prey domain, Half Field Offense and Spread game, showing that ourapproach outperforms existing advising methods under unlimited and limited budgets. In section6, we give an analysis of advising actions and sharing Q-values, and discuss which situations aresuitable for PSAF. Finally, in section 7, we draw the conclusions of this article and propose futuredirections.
The idea of sharing Q-values has its roots in multiagent reinforcement learning. The Q-values ofreinforcement learners are the expected cumulative discounted reward gained by performing actionsin states during learning. Sharing Q-values reduce the time to learn actions to be rewarded. The
A P
REPRINT centralized learning method [7] learns a shared Q-function for the actions and observations of allagents in an environment. Both the state and action space scale exponentially with the number ofagents, rendering this approach infeasible for thousands of agents learning together. By contrast, weconsider that agents are decentralized in both training and execution, which is convenient for agentsjoining and leaving [8, 9, 10]. Concretely, each agent is equipped with separate processor andmemory to update and store its own policy. One simple method without any communication is touse independent learners [11, 8], where each agent individually learns its policy or Q-function basedon its own observations. However, it is nontrivial how decentralized agents benefit from sharingQ-values with one another, in order to accelerate learning process. The first line of sharing Q-valuesassumes that the Q-values being shared has been prepared before current learning. One typicalapproach is transfer learning, especially for transferring Q-functions among different tasks [12, 13],whereby an agent can learn faster in a target task after training on a less complex task. Even thoughthe method leads to a dramatic speedup in learning, it requires to build hand-coded action-valuefunction relationship, for example, how to map different state representations and reward functionsfor different tasks. When several agents learn to cooperate in a multiagent system, they may needto coordinate in a few states while act independently in the other states. By using this idea, Kokand Vlassis [14] propose Sparse Tabular Multiagent Q-learning that maintains a predefined list ofstates in which coordination is necessary. Only in these states, agents perform joint actions anda shared Q-table is updated. Then they learn their own Q-values in all remaining states. In orderto automatically identify coordinating states, De Hauwere et. al propose CQ-learning [15] andFCQ-learning [16]. Both methods assume that an agent has already learned its optimal policy alonebefore learning together with other agents. When an agent executes in the multiagent environment,a state is marked as conflict if it detects a change in the received immediate reward compared withthe expected reward derived from its optimal single-agent policy. Then it incorporates its previouslyacquired Q-values from single-agent environment into currently joint Q-function, otherwise previousQ-values are used to select next actions. However, these works assume that agents have alreadylearnt (or are able to learn) some kind of single-agent knowledge (e.g., local value function) beforethe multi-agent learning process. By comparison, we consider that multiple agents start learningwith no previous knowledge in a shared environment, then they must rely on their partners who mayhave explored different regions of the state space. Tan [17] points out that cooperative Q-learningagents can communicate several aspects of their learning process for current task, such as policies(or complete Q-tables), episodes (state, action, quality triplets), and sensation. The Q-values forevery state-action pair of all agents are averaged in a predefined time interval, and then the averagedQ-values are assigned for each agent. Nevertheless, imagine that agents communicate over wirelesschannels such as Wi-Fi and LTE, sharing the whole Q-values may incur high-communication cost,especially when agents learn in a large-scaled multiagent system. One of the relevant works mostsimilar to ours is Parallel Transfer Learning (PTL) [18, 19], which enables multiple agents totransfer their Q-values when they are simultaneously learning in different tasks or the same task. Inorder for knowledge sharing to be beneficial, the Q-values to be shared are carefully selected. Whenall agents learn together in a common environment, a fixed number of Q-values is transferred atevery time step according to the importance of associated states. In PTL, it is necessary for agentsto connect with each other constantly in order to share their Q-values. In comparison, we seek toaddress how agents communicate with each other only if necessary, and an agent is able to decidefor what situation it can share its Q-value when requested. The major merit of our framework is its
A P
REPRINT flexibility and applicability for an enormous quantity of agents learning cooperatively under budgetconstraint.There are many ways that address the issue of how to use received Q-values from other agents, eventhough they generally ignore the problems of when to shared Q-values, as emphasized in our work.These methods can be classified into: a) Average-Q; b) Weighted-Q; c) Replace-Q. The Average-Qmethod, also known as Policy Average [17], is the simplest way to utilize the shared Q-values. Thismethod assumes that all agents have the same contribution to the corresponding state-action pairs.When an agent communicates with other agents, its own Q-values as well as the shared Q-valuesare averaged, so as to generate the new Q-values for corresponding state-action pairs. Since agentsmay have experienced different areas of the state-action space at a given time step, some outdatedQ-values are taken into account in the Average-Q method. An agent needs to combine the sharedQ-values more effectively. The Weighted-Q method allows a weighted combination of received andlocal Q-values. Ahmadabadi and Asadpour [20] propose several weighted strategies to generate newQ-values from an agent’s current Q-values and the received Q-values. In this method, each agentmeasures the expertness of its teammates and assigns a weight to their knowledge and learns fromthem accordingly. In PTL [19], as the number of times that an agent visits a state increases, locallylearned Q-values are taken more into account than the received Q-values. However, when an agentknows nothing or very few about a particular state, incorporating its Q-values may sharply decreasethe new Q-values, which makes the learning process become unstable. In the Replace-Q method, anagent accepts the shared knowledge fully as it has no better information. The more complex case isthat the received Q-values are potentially from multiple sources. The agent should decide to selectone of them. The experience counting, inspired by the non-trivial update counting method in [21],is another popular sharing strategy [22]. The experience counting makes use of an experience table,in addition to a Q-function to keep track of which states and what actions an agent has actuallyvisited during learning. The central idea behind this method is that the most experienced agentshould have the most contribution to the Q-value with regard to corresponding state-action pair.At every time step, all agents use the same Q-value which is from the the agent with the mostexperience for every state-action pair. Considering a tabular representation of a Q-function, whichis, perhaps, the most commonly applied in RL tasks, the new Q-values generated from Average-Q,Weighted-Q and Replace-Q can be directly assigned to corresponding state-action pairs. Anotherpossible way for absorbing the shared Q-values is to update local Q-values in a Q-learning-like rule,e.g., CQ-learning and FCQ-learning. In the Q-learning-like rule, after performing actions in currentstates, the Q-values from other knowledge resources for the next states are used to bootstrap theQ-values for the current states. This method does not need to directly access an agent’s Q-function(e.g., Q-table), which can be extended to the problem with continuous state or action space.Recently, the teacher-student framework [4] has received a lot of attention [5, 23, 24, 13]. In thisframework, a more experienced agent (teacher) accelerate the learning process of another agent(student) by providing advice on which action to take. The student updates its policy based onrewards received from the environment, while its exploration is guided by the teacher’s advice. Someworks, like the supervision framework designed for organizational structure [25] or supervisingreinforcement learning [26], propose to supervise a network of agents by providing rules (forbiddenactions of satisfied states), suggestions (preferred actions of satisfied states) et al. However, allsupervisors are fixed, and the agents who are supervised by a supervisor need to communicate withtheir own supervisor at every time step. By contrast, the most significant aspect of the teacher-
A P
REPRINT student framework is that advising opportunities are determined only when required, consideringthe practical concerns regarding attention and communication. This setting is also widely acceptedby many extensions of the teacher-student framework, as depicted in [27, 28, 5]. Moreover, both themultiagent advising framework [5] and our work assume no fixed roles of each agent, which meanseach agent has the opportunity to be a teacher (or sharer). There have been three modes of decidingthe advising opportunities: student-initiated [29], teacher-initiated [4] and jointly-initiated [30]. Inthe student-initiated method, a student who learns to act optimally in a task is assisted by an expertteacher whenever the student’s confidence in a state is low. Sharing decisions made by the studentare likely to be weak since itself is still learning. Conversely, in the teacher-initiated method, atrained teacher decides when to give advice to a student. This method requires the student’s currentstate is always communicated to the teacher, and the teacher should constantly pay attention to thelearning of the student. Transmitting every state that the student has experienced to the teachercan cause a prohibitive cost of communication. The jointly-initiated approach determines advisingopportunities under the agreement of both teacher and student. In this method, the teacher is notrequired to constantly monitor the student. The relation of a student and a teacher is established ondemand, which is particularly suitable for the case of limited communication. The work of Silvaet al. [5] is the first to apply the teacher-student framework to a multiagent system composed ofmultiple simultaneously learning agents. In order to overcome the challenge of having no fixedroles of teacher and student, they extend the heuristics from [4] to measure an agent’s confidence ina given state based on the number of times that it visits the state. The number of times that a studentasks for advice and a teacher gives action advice are limited by two budgets respectively. Withthe constraint of budget, this approach achieves state-of-the-art results even when comparing withsharing the whole episode during learning. Despite having promising results, the advising strategythat uses actions as advices may not be good enough to accelerate the overall learning process forQ-learners.
We are interested in a cooperative multi-agent setting where agents get local observations andlearn in a decentralised fashion. All communications among agents must be clearly specified. Thelearning problem of multiple decentralised agents with local observations is generally modelledas a Decentralised partially observable Markov decision process (Dec-POMDP) [31], which isan extension of Markov Decision Process (MDP) [32]. A Dec-POMDP is defined by a tuple (cid:104)I , S , A , T , R , Ω , O , γ (cid:105) , where I is the set of n agents, S is the set of environment states, A = × i ∈I A i is the set of joint actions, T is the state transition probabilities, R is the reward function, Ω = × i ∈I Ω i is the set of joint observations, O is the set of conditional observation probabilities, and γ ∈ [0 , is the discount factor. At every time step t in an environment state s (cid:48) , each agent i perceivesits own observation o it from one joint observation o = (cid:104) o , ..., o n (cid:105) determined by O ( o | s (cid:48) , a ) , where a = (cid:104) a , ..., a n (cid:105) is the joint action that causes the state transition from s to s (cid:48) according to T ( s (cid:48) | a , s ) ,and receives reward r i determined by R ( s, a ) . We focus on cooperative Multiagent ReinforcementLearning (MARL), where several reinforcement learning agents jointly affect the environment andreceive the same reward ( r = r = ...r n ). The observability of agents in Dec-POMDPs has beenelaborately discussed in Oliehoek and Amato’s work [31]. In our framework, we assume that agents A P
REPRINT are able to observe each other at any time and infer their local observations. That is, individualobservation (partial view) for each of the agents always uniquely identifies the environment state.Since agents are distributed, it is still difficult to learn the optimal policy for each of them [31].
Temporal difference (TD) learning algorithms such as Q-learning [33] and SARSA [32] learn anaction-value function, Q ( s, a ) , which is an estimate of the expected cumulative discounted rewardthat an agent takes action a in state s . In our work, each agent individually receives its own statefrom a shared environment, and learn a Q-function. Learning a policy for an agent i means to betterestimate its Q-function Q i ( s i , a i ) for corresponding action a i in its own state s i . Since all agentshave the same reward function (equivalent to a joint reward function), their Q-functions are alsothe same when no explicit specialization is assigned, e.g., different learning structures. We definethe experience of agent i at time step t as a tuple (cid:104) s it , a it , s it +1 , r it +1 (cid:105) , where state s it +1 is reached bythe agent after it executing action a it in its state s it , and reward r it +1 is received. The Q-functionof agent i is incrementally updated based on the agent’s experience gained in learning, using theweighted average of the old value and the new information. Therefore, the update of Q-values foreach state-action pair are defined as follows: Q it ( s it , a it ) ← Q it ( s it , a it ) + αδ (1)where α ∈ [0 , is learning rate, and δ is TD error. In Q-learning, we have: δ = r it +1 + γ max a Q it ( s it +1 , a ) − Q it ( s it , a it ) (2)where γ ∈ [0 , is discount factor. RL agents face to choose whether to focus on high rewardactions or taking actions with the intent of exploring the environment. We consider agents adopt (cid:15) -greedy for action selection. As such, at every time step, with a large probability − (cid:15) , an agenttakes the action with the highest Q -value in the current state, and with a small probability (cid:15) , theagent takes a random action. Then in normal learning (without asking for advice), a Q-learningagent can choose the action to be executed at every learning step according to (cid:15) -greedy. SARSA,which is another popular RL algorithm, defines the TD error as follows: δ = r it +1 + γQ it ( s it +1 , a it +1 ) − Q it ( s it , a it ) (3)where a it +1 is the next action that agent i will execute in s it +1 according to a defined explorationstrategy like (cid:15) -greedy. These algorithms are guaranteed to converge to the optimal Q-function Q ∗ i ,from which the optimal policy π ∗ i of agent i can be derived: π ∗ i ( s i ) = arg max a Q ∗ i ( s i , a ) (4)In order to speed up reinforcement learning, instead of updating one Q-value at every time step,n-steps can be used for making a backup. We adopt Eligibility traces [32] to further enhance theperformance of TD algorithms. In this strategy, each agent records the state-action pairs whichhave been recently visited. The TD error at current time step is used to update the Q-values forthese state-action pairs. Eligibility traces aims to assign the credit or blame to the eligible states or In this paper, the local observation of an agent is also referred as the agent’s own state. The former one is from a multi-agentview while the latter one is from a single-agent view
A P
REPRINT actions. At time step t in an episode, agent i updates its accumulating traces for the states that havebeen visited and the actions that have been taken so far by following rules: e it ( s i , a i ) = (cid:40) γλe it − ( s i , a i ) , if s i (cid:54) = s it γλe it − ( s i , a i ) + 1 , if s i = s it (5)where λ ∈ [0,1] refers to traces decay rate. Q( λ ) and SARSA( λ ) are the extension of Q-learning andSARSA, respectively. In Q( λ ) or SARSA( λ ), an agent uses weighted TD errors by using eligibilitytraces as weights to update Q-values for every experienced state-action pair in current episode. ThenEq. 1 is modified as follows: Q i ( s i , a i ) ← Q i ( s i , a i ) + αδe it ( s i , a i ) (6) Recent works on teacher-student framework mainly adopt jointly-initiated method to build advisingrelations, which is more appropriate for learning with budget constraint. Here we introduce a multi-agent advising framework AdhocTD[5], in order to illustrate how to construct advising relations(for action advice) only when necessary. This framework has shown promising results in complexstochastic environment Half Field Offense[34], and is considered as the state-of-the-art approach.In AdhocTD, at each time step, agent a i asks for advice with an asking probability P ask in currentstate s. The asking probability is calculated as follows: P ask ( s ) = (1 + v a ) − √ n visit ( s ) (7)where v a is a predetermined parameter, n visit ( s ) is the number of times that the agent visits state s .Agent a j gives its best action with a giving probability P give for the state of agent a i . The givingprobability is calculated as follows: P give ( s ) = 1 − (1 + v b ) − √ n visit ( s ) × I ( s ) (8)where v b is a predetermined parameter, and I ( s ) encodes the difference between agent a j ’s maximumQ-value and its minimum Q-value in the requested state s : I ( s ) = max a Q ( s, a ) − min a Q ( s, a ) (9)All agents are limited by a budget b ask to ask for advice and a budget b give to give advice. Our approach aims to accelerate learning for a cooperative team of reinforcement learning agentsin a multiagent environment. These agents are assumed to be homogeneous, meaning that theyare interchangeable, i.e. they have identical optimal policies and value functions. During learningprocess, each agent may have unique experience or local knowledge (i.e., Q-functions) of how toperform effectively in current task. They may not be willing to share the whole Q-functions witheach other due to budget constraint. However, we assume that the team of agents would like toshare a limited number of Q-values during entire lifetime, which is modelled as the limited numberof times that an agent asks for and shares Q-values. Specifically, each agent can decide to share itsmaximum Q-value for the current state of the agent who asks for Q-values. The maximum Q-values
A P
REPRINT from other agents guide an agent’s exploitation of the team’s currently learned Q-values. The goalof our method is to explore when to ask for Q-values and share the maximum Q-values, as well ashow to use the received maximum Q-values.We here propose a partaker-sharer advising framework (PSAF) for multiple decentralized Q-learners to share Q-values under budget constraint. A sharing process is initiated by a partaker ,which is a role of an agent who asks for Q-value in its own state. The other agents can take the roleof sharer , and share their maximum Q-values for the state of the partaker. Then the partaker willchoose its current action based on its own Q-values as well as the sharers’ Q-values. Each agentcan take different roles in different sharing processes, which depends on whether they decide toask for and share Q-values for, to be specific, particular states. In this way, an agent may ask forQ-values in current situation while share its maximum Q-value for the requested state of anotheragent. The maximum number of times that an agent takes the role of partaker and sharer are twonumeric budgets b ask and b give respectively.In PSAF, we consider that an agent is more likely to ask for Q-values if it knows very few aboutsome states, and it is more beneficial for the agent to be guided by other more experienced agentsfor those states. Several heuristics from [30, 5] can be used to decide when an agent takes the roleof partaker. They rely on the range of Q-values for the requested state or how many times an agentvisit the state. The range of Q-values can vary very much at the beginning of training so that it maymislead an agent who tries to ask for Q-values, where we expect the agent would like to get morehelp. AdhocTD [5] has defined an asking function P ask ( s ) = (1 + v a ) − √ n visit ( s ) by considering thenumber of times an agent visits a state s . The function outputs a higher probability for requestingadvice when the agent visits state s very few times, while the value of the function declines rapidlywhen the agent gains more experience in that state. In this paper, we adopt the same function P iask ( s i ) to allow an agent i to determine when to initiate a sharing process and take the role ofpartaker in current state s i . After successfully initiating a sharing process, another agent j (fromall the agents except partaker i ) needs to decide whether it would share its maximum Q-values tothe partaker. Previous works on teacher-student framework have proposed several methods (e.g.,function P give defined by AdhocTD) to help a teacher to decide when to advise a student withoutaccessing the student’s current learning. However, if both student and teacher learn from scratch,i.e., their policies are generally non-optimal during learning, the teacher may be less helpful or evensuggest worse advice if it is not able to evaluate how well the student have learned. In PSAF, duringa sharing process, partaker i will tell other agents how confident it is in its Q-values for currentstate s i . Sharing Q-values with low confidence may interfere the learning of agents who take therole of partaker, and waste communications. Therefore, the other agents can joint in the sharingprocess (of partaker i ) and share their Q-values if they have higher confidence in the Q-values thanthe partaker. Intuitively, as training progresses, if an agent updates its Q-value for a state-action pairmore times, the Q-value generally becomes more reliable. We firstly propose a confidence function Φ i : S × A → R , which encodes how confident partaker i is in currently learned Q-values for state s i . Formally we have: Φ i ( s i , a ) = m ivisit ( s i , a ) (10)where m ivisit ( s i , a ) is the number of times that partaker i updates its Q-value for a state-actionpair ( s i , a ) . In order to share much better Q-values than the partaker and then use the budgetmore efficiently, another agent j who intends to share its own Q-values only if it has updatedthe Q-values many more times compared with the partaker. Here we propose another confidence A P
REPRINT function Ψ j : S × A → R , which represents the confidence of agent j for state-action pair ( s i , a ) . Ψ j is expected to scale down the updated times of the Q-value for ( s i , a ) to a proper value. Then wedefine Ψ j as follows: Ψ j ( s i , a ) = m jvisit ( s i , a ) × ξ j (11)where m jvisit ( s i , a ) is the number of times that agent j updates its Q-value for state-action pair ( s i , a ) , and ξ j ∈ [0 , . ξ j is used to determine how many times agent j should have updated theQ-value for ( s i , a ) when comparing with partaker i , so that agent j can take the role of sharer. When ξ j is 0, agent j can not share its Q-values. When ξ j is 1, the agent share its Q-values as long as it hasupdated the Q-values more times than partaker i . Now we consider how to construct ξ j . In a givenstate, if all Q-values are nearly the same, it does no matter which one is shared. If the maximumQ-value of agent j in a state is much higher than other Q-values, sharing the maximum Q-value ismore meaningful for partaker i since the Q-value may lead to a fine-gained action. Then agent j ismore likely to be as a sharer even it does not update the maximum Q-value many times. For thesake of convenience, we define ξ j of agent j to be the difference between the maximum Q-valueand minimum Q-value normalized to [0 , in state s i as follows: ξ j ( s i ) = max a Q j ( s i , a ) − min a Q j ( s i , a )max a Q j ( s i , a ) − min a Q j ( s i , a ) + 1 (12)where max a Q j ( s i , a ) and min a Q j ( s i , a ) are the maximum and the minimum Q-value of agent j in state s i respectively. If the difference between max a Q j ( s i , a ) and min a Q j ( s i , a ) is close to 0,then ξ j is low and agent j should update its maximum Q-value in state s i quite a number of times tomake sure the Q-value is reliable. If the difference between max a Q j ( s i , a ) and min a Q j ( s i , a ) islarge, ξ j is high, so that agent j will have more opportunities to share its maximum Q-value.Algorithm 1 describes when to ask for Q-values and how to use the shared maximum Q-values foragent i . At every time step, the agent in current state s i can take the role of partaker as long asbudget b iask has not been used up (lines 1-3). With the probability calculated by asking function P iask ( s i ) , partaker i initiates a sharing process and broadcasts to all other agents for requestingQ-values (lines 4-6). The broadcast message contains the partaker’s state s i , and its confidence ofeach Q-value under current situation. Then partaker i waits until a predefined timeout to collectanswers. We denote Π as the collection of Q-values from all sharers for the state of the partaker(lines 7-8). If Π is not empty, b iask is decremented by 1 (lines 9-10). Partaker i may receive severalQ-values for the same state-action pair in the sharing process, since some sharers may have thesame best action according to their own Q-values. Besides, each sharer can not access other sharers’Q-functions to avoid additional communication before joining in the partaker’s sharing process. InPSAF, the partaker can either randomly select one of the shared Q-values or choose the Q-valuewith maximum confidence among them, which is defined by function Γ (line 11). Now, partaker i needs to integrate the Q-value selected from function Γ to its local Q-function in current state s i . In this paper, we assume a tabular representation of the Q function. For each correspondingstate-action pair, partaker i replaces the original Q-value with the shared Q-value from Γ (lines12-14). Then, in order to exploit the shared Q-values from the whole team, the partaker executes itscurrently best action corresponding to the currently maximum Q-value in state s i (line 15). If agent i does not ask for Q-values or no Q-value is received, the agent performs usual exploration strategy,i.e., (cid:15) -greedy (lines 16-17). A P
REPRINT
Algorithm 1
Ask for Q-values and use the shared maximum Q-values in a given state
Require: agent i , budget b iask , asking function P iask , and function Φ i . for each time step do let current state be as s i if b iask > then p ← getRandomV alue (0 , if p < P iask ( s i ) then broadcast state s i and the confidence of Q-values derived by Φ i for each of the other agents do add the shared Q-value to collection Π if Π (cid:54) = ∅ then b iask ← b iask − Π = Γ(Π) (cid:46) each action is associated with one Q-value for each state-action pair of Π do denote Q j ( s i , a ) as the Q-value for state-action pair ( s i , a ) Q i ( s i , a ) ← Q j ( s i , a ) execute greedy exploration strategy if no action is executed then perform usual exploration strategy (i.e., (cid:15) -greedy) Algorithm 2
Share the maximum Q-value for a given state
Require: agent j , function Ψ j , budget b jgive , state s i , and function Φ i . switch from current state s j to state s i if b jgive > then a ∗ j ← arg max a Q j ( s i , a ) if Ψ j ( s i , a ∗ j ) > Φ i ( s i , a ∗ j ) then b jgive ← b jgive − return a ∗ j , Q j ( s i , a ∗ j ) and Ψ j ( s i , a ∗ j ) switch from state s i to state s j Algorithm 2 describes when agent j provides its maximum Q-values to partaker i for state s i . Aslong as budget b jgive has not been used up, agent j compares its confidence of the maximum Q-valuein s i with the partaker (lines 1-3). Agent j shares its maximum Q-value if it has higher confidencein the Q-value than partaker i . If agent j takes the role of sharer, budget b jgive is decremented by1 (lines 4-5). Then sharer j sends its maximum Q-value in state s i , the associated action, and theconfidence of the Q-value to partaker i (line 6). After that, the partaker’s action selection in state s i will be guided by the sharer. We are interested in a multiagent system where several agents cooperatively solve a task. Each agentindependently observes the environment and learns its own Q-values. The advising framework[5] using actions as advices achieves excellent results in a similar setting as our work. And this
A P
REPRINT framework has shown to surpass sharing a successful episode under budget constraint. We comparePSAF with the following approaches in our experiments.a.
Multi-IQL [11]: Each agent is independent learner and learns individual Q-function. Thereis no communication among all agents. Multi-IQL serves as a baseline method to validate thebenefit of advising actions and sharing Q-values.b.
AdhocTD [5]: AdhocTD is a state-of-the-art method on action advising for multiagent learning,and the detail of this framework is illustrated in Section 3.3. We compare PSAF with AdhocTDsince we argue that sharing Q-values is the most effective way to promote the learning ofQ-learners.c.
AdhocTD-Q : We adapt AdhocTD in a straightforward way so that agents can share Q-valueswith one another. In AdhocTD-Q, agents use asking function P ask and giving function P give defined by AdhocTD, respectively, to ask for Q-values and share their maximum Q-values.We evaluate PSAF, AdhocTD, AdhocTD-Q, and Multi-IQL in three cooperative games. Predator-Prey domian is a popular benchmark for multiagent learning. Half Field Offense is a more complexrobot soccer game. Spread game is a quickly deployable domain. Since current experiments onlycontain few agents to interact with each other, we adopt random selection strategy (function Γ )for each partaker in AdhocTD-Q and PSAF to utilize the shared Q-values. In all methods (exceptMulti-IQL), the number of times that an agent asks for and provides Q-values (or actions) arelimited by budgets b ask and b give respectively. When we report that the difference between twocurves is significant, it means we have at least 95% confidence that one curve has larger area byusing t-tests on their areas with α = 0.05. (a) Predator-Prey domain (b) The prey is caught. Figure 1: Left: Predator-Prey domain with four predators and one prey. The red one is a predator,and the blue one is a prey. Right: Four predators are next to the prey.Predator-Prey (PP) domain is easy to implement, customize, and the results are easy to interpret.Therefore, it has been extensively used to evaluate multiagent learning algorithms [35, 36, 37]. Ourimplementation includes an N × N grid world, where N is the number of cells in x ( y ) direction. Asshown in Figure 1a, there are four predators and one prey in the grid world. Each of them occupiesone cell, and one cell is allowed to be occupied by only one agent to avoid the case of deadlock. Thefour predators and the prey can choose between five actions Stay , Go Up , Go Down , Go Left , and A P
REPRINT
Go Right . By executing an action, each agent moves cell by cell in corresponding direction. In thisgame, the prey takes a random action of the time, with rest of the time moving away from allpredators, making the task harder than with a fully random prey. Four predators are reinforcementlearning agents. They learn to cooperatively catch the prey as soon as possible. Each predator fullyobserves the relative x and y coordinates of other predators and the prey. All values of states arenormalised to [ − , by dividing by the number of cells N . The prey is caught only when fourpredators are next to the prey in four cardinal directions, as shown in Figure 1b. If the predatorscatch the prey, all predators receive a reward of 1, otherwise 0. For predators not sharing actionsor Q-values, each of them is equipped with Q ( λ ) , where λ =0.9, γ =0.9, and α =0.1. The (cid:15) -greedywith (cid:15) =0.1 is used as exploration strategy for four predators. Since all predators are learning andchanging their policies, it is very hard for them learning with Q ( λ ) . Tile coding [38, 32] is used toforce a generalization over the state space, with 8 tilings and tile-width 0.5.The PP domain has one popular metric for performance evaluation. Time to Goal (TG) is the numberof steps that predators take to catch the prey. Lower TG values means that the predators catch theprey more quickly. One episode starts when four predators and the prey are initialized with randompositions in the grid world. The episode ends when either predators catch the prey, or a time limitis exceeded. In each episode, the maximum number of steps that predators and the prey can playis 2,500. Four predators are trained for 10,000 episodes. After every 100 training episodes, theTG values for each of these episodes are gathered and averaged to get more stable values. Theevaluation process is computationally efficient, and it is able to clearly present the trend in TGvalues for different methods. Then we obtain one run containing all evaluated TG values for 10,000training episodes. The process is repeated 100 times. For the parameters v a and v b of AdhocTDand AdhocTD-Q, and v a of PSAF, we firstly tune v a and v b for AdhocTD with unlimited budgetto get significant lower TG values than Multi-IQL and to spend not very high budget. Then weapply these parameters to all other experiments. Finally, we choose v a =0.2 and v b =1 for AdhocTD,AdhocTD-Q and PSAF.In order to explore the effect of different amount of budgets on the performance of each method, weconsider three cases: (1) there are unlimited budgets, and we set b ask = b give = + ∞ ; (2) there arelimited budgets, and we set b ask = b give = b ask = b give = A P
REPRINT x100 Traning Episodes T i m e o G o a l Four Preda ors Ca ch One Prey, b=+∞PSAFAdhocTDAdhocTD-QMulti-IQL (a) TG of 10,000 episodes with b=+ ∞ x100 T aning Episodes T i m e t o G o a l Fou P edato s Catch One P ey, b=+∞PSAFAdhocTDAdhocTD-QMulti-IQL (b) TG between 50 and 200 with b=+ ∞ x100 Traning Episodes T i m e t o G o a l Four Predators Catch One Prey, b=2500PSAFAdhocTDAdhocTD-QMulti-IQL (c) TG of 10,000 episodes with b=2,500 x100 Traning Episodes T i m e t o G o a l Four Predators Catch One Prey, b=2500PSAFAdhocTDAdhocTD-QMulti-IQL (d) TG between 50 and 200 with b=2,500 [ 7 U D Q L Q J ( S L V R G H V 7 L P H W R * R D O ) R X U 3 U H G D W R U V &