[PDF] A Maximum Mutual Information Framework for Multi-Agent Reinforcement Learning

Abstract

In this paper, we propose a maximum mutual information (MMI) framework for multi-agent reinforcement learning (MARL) to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the mutual information between actions. By introducing a latent variable to induce nonzero mutual information between actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic (VM3-AC), which follows centralized learning with decentralized execution (CTDE). We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms MADDPG and other MARL algorithms in multi-agent tasks requiring coordination.

Full PDF

AA Maximum Mutual Information Framework forMulti-Agent Reinforcement Learning

Woojun Kim, Whiyoung Jung, Myungsik Cho, Youngchul Sung

School of Electrical Engineering, KAIST, Korea{woojun.kim, wy.jung, ms.cho, ycsung}@kaist.ac.kr

Abstract

In this paper, we propose a maximum mutual information (MMI) framework formulti-agent reinforcement learning (MARL) to enable multiple agents to learncoordinated behaviors by regularizing the accumulated return with the mutualinformation between actions. By introducing a latent variable to induce nonzeromutual information between actions and applying a variational bound, we derivea tractable lower bound on the considered MMI-regularized objective function.Applying policy iteration to maximize the derived lower bound, we propose apractical algorithm named variational maximum mutual information multi-agentactor-critic (VM3-AC), which follows centralized learning with decentralized exe-cution (CTDE). We evaluated VM3-AC for several games requiring coordination,and numerical results show that VM3-AC outperforms MADDPG and other MARLalgorithms in multi-agent tasks requiring coordination.

With the success of RL in the single-agent domain [18, 13], MARL is being actively studied andapplied to real-world problems such as trafﬁc control systems and connected self-driving cars, whichcan be modeled as multi-agent systems requiring coordinated control [12, 1]. The simplest approachto MARL is independent learning, which trains each agent independently while treating other agentsas a part of the environment. One such example is independent Q-learning (IQL) [24], which is anextension of Q-learning to multi-agent setting. However, this approach suffers from the problem ofnon-stationarity of the environment. A common solution to this problem is to use fully-centralizedcritic in the framework of centralized training with decentralized execution (CTDE) [19, 21]. Forexample, MADDPG [16] uses a centralized critic to train a decentralized policy for each agent,and COMA [4] uses a common centralized critic to train all decentralized policies. However, theseapproaches assume that decentralized policies are independent and hence the joint policy is theproduct of each agent’s policy. Such non-correlated factorization of the joint policy limits the agentsto learn coordinated behavior due to negligence of the inﬂuence of other agents [25, 2]. However,learning coordinated behavior is one of the fundamental problems in MARL [25, 15].In this paper, we introduce a new framework for MARL to learn coordinated behavior under CTDEwithout previously-used explicit dependency or communication in the execution phase. Our frame-work is based on regularizing the expected cumulative reward with mutual information among agents’actions induced by injecting a latent variable. The intuition behind the proposed framework is thatagents can coordinate with other agents if they know what other agents will do with high probability,and the dependence between action policies can be captured by the mutual information. High mutualinformation among actions means low uncertainty of other agents’ actions. Hence, by regularizing theobjective of the expected cumulative reward with mutual information among agents’ actions, we cancoordinate the behaviors of agents implicitly without explicit dependence enforcement. However, theoptimization problem with the proposed objective function has several difﬁculties since we consider

Preprint. Under review. a r X i v : . [ c s . M A ] J un ecentralized policies without explicit dependence or communication in the execution phase. Inaddition, optimizing mutual information is difﬁcult because of the intractable conditional distribution.We circumvent these difﬁculties by exploiting the property of the latent variable injected to inducemutual information, and applying variational lower bound on the mutual information. With theproposed framework, we apply policy iteration by redeﬁning value functions to propose the VM3-ACalgorithm for MARL with coordinated behavior under CTDE.Due to space limitation, related works are provided in Appendix A. We consider a Markov Game [14], which is an extention of Markov Decision Process (MDP) tomulti-agent setting. An N -agent Markov game is deﬁned by an environment state space S , actionspaces for N agents A , · · · , A N , a state transition probability T : S × A × S → [0 , , where A = (cid:81) Ni =1 A i is the joint action space, and a reward function R : S × A → R . At each time step t ,agent i executes action a it ∈ A i based on state s t ∈ S . The actions of all agents a t = ( a t , · · · , a Nt ) yields next state s t +1 according to T and yields shared common reward r t according to R underthe assumption of fully-cooperative MARL. The discounted return is deﬁned as R t = (cid:80) ∞ τ = t γ τ r τ ,where γ ∈ [0 , is the discounting factor.We assume CTDE incorporating resource asymmetry between training and execution phases, widelyconsidered in MARL [16, 8, 4]. Under CTDE, each agent can access all information includingthe environment state, observations and actions of other agents in the system in the training phase,whereas the policy of each agent can be conditioned only on its own action-observation history τ it or observation o it in the execution phase. For given joint policy π = ( π , · · · , π N ) , the goalof fully cooperative MARL is to ﬁnd the optimal joint policy π ∗ that maximizes the objective J ( π ) = E π (cid:2) R (cid:3) . Maximum Entropy RL

The goal of maximum entropy RL is to ﬁnd an optimal policy that maximizesthe entropy-regularized objective function, given by J ( π ) = E π (cid:34) ∞ (cid:88) t =0 γ t (cid:16) r t ( s t , a t ) + αH ( π ( ·| s t )) (cid:17)(cid:35) (1)It is known that this objective encourages the policy to explore widely in the state and action spacesand helps the policy avoid converging to a local minimum. Soft actor-critic (SAC), which is basedon the maximum entropy RL principle, approximates soft policy iteration to the actor-critic method.SAC outperforms other deep RL algorithms in many continuous action tasks [7].We can simply extend SAC to multi-agent setting in the manner of independent learning. Eachagent trains decentralized policy using decentralized critic to maximize the weighted sum of thecumulative return and the entropy of its policy. We refer to this method as Independent SAC (I-SAC).Adopting the framework of CTDE, we can replace decentralized critic with centralized critic whichincorporates observations and actions of all agents. We refer to this method as multi-agent softactor-critic (MA-SAC). Both I-SAC and MA-SAC are considered as baselines in the experimentsection. We assume that the environment is fully observable, i.e., each agent can observe the environment state s t for theoretical development in this section, and will consider the partially observable environmentfor practical algorithm construction under CTDE in the next section.Under the proposed MMI framework, we aims to ﬁnd the policy that maximizes the mutual informa-tion between actions in addition to cumulative return. Thus, the MMI-regularized objective functionfor joint policy π is given by J ( π ) = E π (cid:34) ∞ (cid:88) t =0 γ t (cid:16) r t ( s t , a t ) + α (cid:88) ( i,j ) I ( π i ( ·| s t ); π j ( ·| s t )) (cid:17)(cid:35) (2)2here a it ∼ π i ( ·| s t ) and α is the temperature parameter that controls the relative importance of themutual information against the reward.As aforementioned, we assume decentralized policies and want the decentralized policies to exhibitcoordinated behavior. Furthermore, we want the coordinated behavior of the agents without explicitdependency previously used to enforce coordinated behavior. Here, explicit dependency [9] meansthat for two agents i and j , the action a it of agent i follows a it ∼ π i ( a it | s t ) and then the action a jt ofagent j follows a jt ∼ π j ( a jt | s t , a it ) , i.e., the input to the policy function of agent j explicitly requiresthe information about the action of agent i for coordinated behavior. By regularization with mutualinformation in the proposed objective function (2), the policy of each agent is implicitly encouragedto coordinate with other agents’ policies without explicit dependency by reducing the uncertaintyabout other agents’ policies. This can be seen as follows: Mutual information is expressed in termsof the entropy and the conditional entropy as I ( π i ( ·| s t ); π j ( ·| s t )) = H ( π j ( ·| s t )) − H ( π j ( ·| s t ) | π i ( ·| s t )) . (3)If the knowledge of π i ( ·| s t ) does not provide any information about π j ( ·| s t ) , the conditional en-tropy reduces to the unconditional entropy, i.e., H ( π j ( ·| s t ) | π i ( ·| s t )) = H ( π j ( ·| s t )) , and the mutualinformation becomes zero. Maximizing mutual information is equivalent to minimizing the un-certainty about other agents’ policies conditioned on the agent’s own policy, which can lead theagent to learn coordinated behavior based on the reduced uncertainty about other agents’ policies. 𝒔 𝒕+𝟏 𝒔 𝒕 𝒂 𝒕+𝟏 𝒂 𝒕+𝟏𝟐 𝒂 𝒕𝟏 𝒂 𝒕 𝒔 𝒕+𝟏 𝒔 𝒕 𝒂 𝒕+𝟏 𝒂 𝒕+𝟏𝟐 𝒂 𝒕𝟏 𝒂 𝒕 𝒛 𝒕+𝟏 𝒛 𝒕 Figure 1: Causal diagram in 2-agent MarkovGame: (a) Standard MARL, (b) Introducingthe latent variable to the standard MARLHowever, direct optimization of the objective func-tion (2) is not easy. Fig. 1(a) shows the causaldiagram of the considered system model describedin Section 2 in the case of two agents with de-centralized policies. Since we consider the caseof no explicit dependency, the two policy distribu-tions can be expressed as π ( a t | s t ) and π ( a t | s t ) .Then, for given environment state s t observed byboth agents, π ( a t | s t ) and π ( a t | s t ) are condi-tionally independent and the mutual information I ( π ( ·| s t ); π ( ·| s t )) = 0 . Thus, the MMI objective(2) reduces to the standard MARL objective of onlythe accumulated return. In the following subsections, we present our approach to circumvent thisdifﬁculty and implement the MMI framework and its operation under CTDE. First, in order to induce mutual information among agents’ policies under the considered systemcausal diagram shown in Fig. 1(a), we introduce latent variable z t . For illustration, consider the newdiagram with latent variable z t in Fig. 1(b). Suppose that the latent variable z t has a prior distribution p ( z t ) , and assume that both actions a t and a t are generated from the observed random variable s t andthe unobserved random variable z t . Then, the policy of agent i is given by the marginal distribution π i ( ·| s t ) = (cid:82) z π i ( ·| s t , z ) p ( z ) dz marginalized over z . With the unobserved latent random variable z ,the conditional independence does not hold for a t and a t and the mutual information can be positive,i.e., I ( π ( ·| s t ); π ( ·| s t )) > . Hence, we can induce the mutual information between actions withoutexplicit dependence by introducing the latent variable. In the general case of N agents, we have π ( a , · · · , a N | s ) = E z [ π ( a | s, z ) · · · π N ( a N | s, z )] . Note that in this case we inject a commonlatent variable z into all agents’ policies. Even with non-trivial mutual information I ( π i ( ·| s t ); π j ( ·| s t )) , it is difﬁcult to directly compute themutual information. Note that we need the conditional distribution of a jt given ( a it , s t ) to computethe mutual information as seen in (4), but it is difﬁcult to know the conditional distribution directly.To circumvent this difﬁculty, we use a variational distribution q ( a jt | a it , s t ) to approximate p ( a jt | a it , s t ) I ( π i ( ·| s t ); π j ( ·| s t )) =: I ij ( s t ) as I ij ( s t ) = E p ( a it ,a jt | s t ) (cid:34) log q ( a jt | a it , s t ) p ( a jt ) (cid:35) + E p ( a it | s t ) (cid:104) KL ( p ( a jt | a it , s t ) (cid:107) q ( a jt | a i , s t ) (cid:105) ≥ H ( π j ( ·| s t )) + E p ( a it ,a jt | s t ) (cid:104) log q ( a jt | a it , s t ) (cid:105) , (4)where the inequality holds because KL divergence is always non-negative. The lower bound becomestight when q ( a jt | a it , s t ) approximates p ( a jt | a it , s t ) well. Using the symmetry of mutual information,we can rewrite the lower bound as I ij ( s t ) ≥ (cid:104) H ( π i ( ·| s t )) + H ( π j ( ·| s t )) + E p ( a i ,a j | s t ) (cid:2) log q ( a i | a j , s t ) + log q ( a j | a i , s t ) (cid:3) (cid:105) . (5)Then, we can maximize the lower bound of mutual information by using the tractable approximation q ( a it | a jt , s t ) . In this subsection, we develop policy iteration for the MMI framework. First, we replace the originalMMI objective function (2) with the following tractable objective function based on the variationallower bound (5): ˆ J ( π , q ) = E π (cid:34) ∞ (cid:88) t =0 γ t (cid:16) r t ( s t , a t ) + αN N (cid:88) i =1 H ( π i ( ·| s t )) + α N (cid:88) i =1 (cid:88) j (cid:54) = i log q ( a jt | a it , s t ) (cid:17)(cid:35) , (6)where q ( a jt | a it , s t ) is the variational distribution to approximate the conditional distribution p ( a jt | a it , s t ) . Then, we determine the individual objective function ˆ J i ( π i , q ) for agent i as thesum of the terms in (6) associated with agent i ’s policy π i or action a it , given by ˆ J i ( π i , q ) = E π (cid:34) ∞ (cid:88) t =0 γ t (cid:16) r t ( s t , a t ) + β · H ( π i ( ·| s t )) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + βN (cid:88) j (cid:54) = i (cid:104) log q ( a it | a jt , s t ) + log q ( a jt | a it , s t ) (cid:124) (cid:123)(cid:122) (cid:125) ( b ) (cid:105)(cid:17)(cid:35) , (7)where β = αN is the temperature parameter. Note that maximizing the term (a) in (7) implies thateach agent maximizes the weighted sum of the policy entropy and the return, which can be interpretedas an extension of maximum entropy RL to multi-agent setting. On the other hand, maximizing theterm (b) with respect to π i means that we update the policy π i so that agent j well predicts agent i ’saction by the ﬁrst term in (b) and agent i well predicts agent j ’s action by the second term in (b). Thus,the objective function (7) can be interpreted as the maximum entropy MARL objective combinedwith predictability enhancement for other agents’ actions. Note that predictability is reduced whenactions are uncorrelated. Since the policy entropy term H ( π i ( ·| s i )) enhances individual explorationdue to maximum entropy principle [7] and the term (b) in (7) enhances predictability or correlationamong agents’ actions, the proposed objective function (7) can be considered as one implementationof the concept of correlated exploration in MARL [17].Now, in order to learn policy π i to maximize the objective function (7), we modify the policy iterationin standard RL. For this, we redeﬁne the state and state-action value functions for each agent asfollows: V π i ( s ) (cid:44) E π (cid:34) ∞ (cid:88) t =0 γ t (cid:16) r t + βH ( π i ( ·| s t )) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a it , a jt , s t ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s = s (cid:35) (8) Q π i ( s, a ) (cid:44) E π (cid:34) r + γV π i ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s = s, a = a (cid:35) , (9)where q ( i,j ) ( a it , a jt , s t ) (cid:44) q ( a it | a jt , s t ) q ( a jt | a it , s t ) . Then, the Bellman operator corresponding to V π i and Q π i is given by T π Q i ( s, a ) (cid:44) r ( s, a ) + γE s (cid:48) ∼ p [ V i ( s (cid:48) )] , (10)4igure 2: Overall operation of the proposed VM3-AC. We only need the operation in the red boxafter training.where V i ( s ) = E a ∼ π (cid:34) Q i ( s, a ) − β log π i ( a i | s ) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a i , a j , s ) (cid:35) (11)In the policy evaluation step, we compute the value functions deﬁned in (19) and (20) by applyingthe modiﬁed Bellman operator T π repeatedly to any initial function Q i . Lemma 1. (Variational Policy Evaluation). For ﬁxed π and the variational distribution q , considerthe modiﬁed Bellman operator T π in (21) and an arbitrary initial function Q i : S × A → R , anddeﬁne Q k +1 i = T π Q ki . Then, Q ki converges to Q π i deﬁned in (20).Proof . See Appendix B.In the policy improvement step, we update the policy and the variational distribution by using thevalue function evaluated in the policy evaluation step. Here, each agent updates its policy andvariational distribution while keeping other agents’ policies ﬁxed as follows: ( π ik +1 , q k +1 ) =arg max π i ,q E ( a i ,a − i ) ∼ ( π i ,π − ik ) (cid:34) Q π k i ( s, a ) − β log π i ( a i | s ) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a i , a j , s )) (cid:35) , (12)where a − i (cid:44) { a , · · · , a N }\{ a i } . Then, we have the following lemma regarding the improvementstep. Lemma 2. (Variational Policy Improvement). Let π inew and q new be the updated policy and thevariational distribution from (35). Then, Q π inew ,π − iold i ( s, a ) ≥ Q π iold ,π − iold i ( s, a ) for all ( s, a ) ∈ ( S × A ) .Proof . See Appendix B.The modiﬁed policy iteration is deﬁned as applying the variational policy evaluation and variationalimprovement steps in an alternating manner. Each agent trains its policy, critics and the variationaldistribution to maximize its objective function (7). Summarizing the development above, we now propose the variational maximum mutual informationmulti-agent actor-critic (VM3-AC) algorithm, which can be applied to continuous and partiallyobservable multi-agent environments under CTDE. The overall operation of VM3-AC is shown inFig. 2. Under CTDE, each agent’s policy is conditioned only on local observation, and centralizedcritics are conditioned on either the environment state or the observations of all agents, depending5n the situation [16]. Let x denote either the environment state s or the observations of all agents ( o , · · · , o N ) , whichever is used. In order to deal with the large continuous state-action spaces, weadopt deep neural networks to approximate the required functions. For agent i , we parameterizethe variational distribution with ξ i as q ξ i ( a j | a i , o i , o j ) , the state-value function with ψ i as V iψ i ( x ) ,two action-value functions with θ i, and θ i, as Q iθ i, ( x , a ) , Q iθ i, ( x , a ) , and the policy with φ i as π iφ i ( a | o i ) = E z [ π iφ i ( a | o i , z )] . We assume normal distribution for the latent variable whichplays a key role in inducing coordination among agents’ policies, i.e., z t ∼ N (0 , I ) , and furtherassume that the variational distribution is Gaussian distribution with constant variance σ , i.e., q ξ i ( a j | a i , o i , o j ) = N ( µ ξ i ( a i , o i , o j ) , σ ) , where µ ξ i ( a i , o i , o j ) is the mean of the distribution. As aforementioned, the policy is the marginalized distribution over the latent variable z , where thepolicies of all agents take the same z t generated from N (0 , I ) as an input variable. We perform therequired marginalization based on Monte Carlo numerical expectation as follows: π ( a | s ) = E z [ π φ ( a | s, z ) · · · π Nφ N ( a N | s, z )] (cid:39) L L (cid:88) l =1 π φ ( a | s, z l ) · · · π Nφ N ( a N | s, z l ) , (13)and we use L = 1 for simplicity. The value functions V iψ i ( x ) , Q iθ i ( x , a ) are updated based on themodiﬁed Bellman operator deﬁned in (21) and (22). The state-value function V iψ i ( x ) is trained tominimize the following loss function: L V ( ψ i ) = E s t ∼ D (cid:20)

12 ( V iψ i ( x t ) − ˆ V iψ i ( x t )) (cid:21) (14)where ˆ V iψ i ( x t ) = E z ∼ N (0 ,I ) , { a i ∼ π i ( ·| o it ,z ) } Ni =1 (cid:34) Q imin ( x t , a t ) − β log π iφ i ( a it | o it ) + βN (cid:80) j (cid:54) = i log q ( i,j ) ξ i ( a it , a jt , o it , o jt ) (cid:35) , D is the replay buffer that stores the transitions ( x t , a t , r t , x t +1 ) ,and Q imin ( x t , a it ) = min [ Q iθ i, ( x t , a it ) , Q iθ i, ( x t , a it )] is the minimum of the two action-valuefunctions to prevent the overestimation problem [5]. The two action-value functions are updated byminimizing the loss L Q ( θ i ) = E ( x t , a t ) ∼ D (cid:20)

12 ( Q θ i ( x t , a t ) − ˆ Q ( x t , a t )) (cid:21) (15)where ˆ Q ( x t , a t ) = r t ( x t , a t ) + γE x t +1 [ V ψ i ( x t +1 )] (16)and V ψ i is the target value network, which is updated by the exponential moving average method. Weimplement the reparameterization trick to estimate the stochastic gradient of policy loss. Then, theaction of agent i is given by a i = f φ i ( s ; (cid:15) i , z ) , where (cid:15) i ∼ N (0 , I ) and z ∼ N (0 , I ) . The policy foragent i and the variational distribution are trained to minimize the following policy improvement loss, L π i ,q ( φ i , ξ ) = E s t ∼ D,(cid:15) i ∼ N ,z ∼ N (cid:34) − Q iθ i, ( x t , a ) + β log π iφ i ( a i | o it ) − βN (cid:88) j (cid:54) = i log q ( i,j ) ξ i ( a i , a j , o it , o jt ) (cid:35) (17)where q ( i,j ) ξ i ( a it , a jt , o it , o jt ) = q ξ i ( a it | a jt , o it , o jt ) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) q ξ i ( a jt | a it , o it , o jt ) (cid:124) (cid:123)(cid:122) (cid:125) ( b ) . (18)Since approximation of the variational distribution is not accurate in the early stage of training and thelearning via the term (a) in (18) is more susceptible to approximation error, we propagate the gradientonly through the term (b) in (18) to make learning stable. Note that minimizing − log q ξ i ( a j | a i , s t ) isequivalent to minimizing the mean-squared error between a j and µ ξ i ( a i , o i , o j ) due to our Gaussianassumption on the variational distribution. 6a) MW (N=3) (b) MW (N=4) (c) PP (N=2)(d) PP (N=3) (e) PP (N=4) (f) CN (N=3)Figure 3: Performance for MADDPG (blue), MA-AC (green), I-SAC (purple), MA-SAC (black), andVM3-AC (the proposed method, red) on multi-walker environments (a)-(b), predator-prey (c)-(e), andcooperative navigation (f). (MW, PP, and CN denote multi-walker, predator-prey, and cooperativenavigation environments, respectively) In the centralized training phase, we pick the actions ( a , · · · , a N ) by using Monte Carlo expectationbased on common latent variable z l generated from zero-mean Gaussian distribution, as seen in (13).We can also achieve the same operation in the decentralized execution phase. This can be done bymaking all agents have the same Gaussian random sequence generator and distributing the same seedto this random sequence generator only once in the beginning of the execution phase. This eliminatesthe necessity of communication for sharing the latent variable. In fact, this way of sharing z l can beapplied to the centralized training phase too. The proposed VM3-AC algorithm is summarized inAppendix C. In this section, we provide numerical results to evaluate VM3-AC. Since we focus on the continuousaction-space case in this paper, we considered four baselines relevant to the continuous action-spacecase: 1) MADDPG [16] - an extension of DDPG with a centralized critic to train a decentralizedpolicy for each agent. 2) I-SAC - an example of independent learning where each agent learns policybased on SAC while treating other agents as a part of the environment. 3) MA-SAC - an extensionof I-SAC with a centralized critic instead of a decentralized critic. 4) Multi-agent actor-critic (MA-AC) - a variant of MA-SAC, i.e., the same algorithm with MA-SAC without the entropy term. Allalgorithms used neural networks to approximate the required functions. In the algorithms exceptI-SAC, we used the neural network architecture proposed in [10] to emphasize the agent’s ownobservation and action for centralized critics. For agent i , we used the shared neural network for thevariational distribution q ξ i ( a jt | a it , o it , o jt ) for j ∈ { , · · · , N }\{ i } , and the network takes the one-hotvector which indicates j as input. Experimental details are given in Appendix E.We evaluated the proposed algorithm and the baselines in the three multi-agent environments withvarying number of agents: multi-walker [6], predator-prey [16], and cooperative navigation [16]. Thedetailed setting of each environments is provided in Appendix D. Fig. 3 shows the learning curves for the considered three environments with the different numberof agents. The y-axis denotes the average of all agents’ rewards averaged over 7 random seeds, and7he x-axis denotes time step. The hyperparameters including the temperature parameter β and thedimension of the latent variable are provided in Appendix E.As shown in Fig. 3, VM3-AC outperforms the baselines in the considered environments. Especially,in the case of the multi-walker environment, the proposed VM3-AC algorithm has large performancegain. This is because the agents in the multi-walker environment are required especially to learncoordinated behavior to obtain high rewards. Hence, we can see that the proposed MMI frameworkimproves performance in complex multi-agent tasks requiring high-quality coordination. The perfor-mance gap between VM3-AC and MA-SAC indicates the effect of regularization with the variationalterm (b) of the objective function (7). Recall that VM3-AC without the variational term (b) of theobjective function (7) reduces to MA-SAC. Recall also that MA-SAC without entropy regularizationreduces to MA-AC, and MA-SAC with decentralized critics instead of centralized critics reducesto I-SAC. Hence, regularization with entropy and use of centralized critics are also important inmulti-agent tasks from the fact that MA-SAC outperforms I-SAC and MA-AC. Note that VM3-ACalso maximizes the entropy through the term (a) of the objective function (7). Indeed, it is seen thatregularization with the variation term in addition to policy entropy enhances coordinated behavior inMARL.(a) MW (N=3) (b) MW (N=4) (c) MW (N=3) (d) MW (N=4)Figure 4: (a) & (b): Impact of the latent variable and (c) & (d): impact of the temperature parameter β Due to the space limitation, more result on comparison with the latest algorithm MAVEN [17] isprovided in Appendix F. It is seen there that VM3-AC signiﬁcantly outperforms MAVEN.

In this section, we provide ablation study on the major techniques and hyperparameter of VM3-AC:1) the latent variable, and 2) the temperature parameter β . Latent variable:

The role of the latent variable is to induce mutual information among actions andpromote coordinated behavior. We compared VM3-AC and VM3-AC without the latent variable(implemented by setting dim ( z ) = 0 ) in the multi-walker environment with N = 3 and N = 4 . Inboth cases, VM3-AC yields better performance that VM3-AC without the latent variable as shown inFig.4(a) and 4(b). Temperature parameter β : The role of temperature parameter β is to control the relative im-portance between the reward and the mutual information. We evaluated VM3-AC by varying β = [0 , . , . , . in the multi-walker environment with N = 3 and N = 4 . Fig. 4(c) and 4(d)show that VM3-AC with the temperature value around [0 . , . yields good performance. In this paper, we have proposed the MMI framework for MARL to enhance multi-agent coordinatedlearning under CTDE by regularizing the cumulative return with mutual information among actions.The MMI framework is implemented practically by using a latent variable and variational techniqueand applying policy iteration. Numerical results show that the derived algorithm named VM3-ACoutperforms other baselines, especially in multi-agent tasks requiring high coordination among agents.Furthermore, the MMI framework can be combined with the other techniques for cooperative MARL,such as value decomposition [21] to yield better performance.8 roader Impact

The research topic of this paper is multi-agent reinforcement learning (MARL). MARL is an importantbranch in the ﬁeld of reinforcement learning. MARL models many of practical control problemsin the real world such as smart factories, coordinated robots and connected self-driving cars. Withthe advance of knowledge and technologies in MARL, solutions to such real-world problems can beimproved and more robust. For example, if the control of self-driving cars are coordinated amongseveral near-by cars, the safety involved in self-driving cars will be improved much. So, we believethat the research advances in this ﬁeld can beneﬁt our safety and future society.

References [1] Andriotis, C. and Papakonstantinou, K. Managing engineering systems with large state andaction spaces through deep reinforcement learning.

Reliability Engineering & System Safety ,191:106483, 2019.[2] de Witt, C. S., Foerster, J., Farquhar, G., Torr, P., Böhmer, W., and Whiteson, S. Multi-agentcommon knowledge reinforcement learning. In

Advances in Neural Information ProcessingSystems , pp. 9924–9935, 2019.[3] Foerster, J., Assael, I. A., De Freitas, N., and Whiteson, S. Learning to communicate with deepmulti-agent reinforcement learning. In

Advances in neural information processing systems , pp.2137–2145, 2016.[4] Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactualmulti-agent policy gradients. In

Thirty-second AAAI conference on artiﬁcial intelligence , 2018.[5] Fujimoto, S., Van Hoof, H., and Meger, D. Addressing function approximation error inactor-critic methods. arXiv preprint arXiv:1802.09477 , 2018.[6] Gupta, J. K., Egorov, M., and Kochenderfer, M. Cooperative multi-agent control using deepreinforcement learning. In

International Conference on Autonomous Agents and MultiagentSystems , pp. 66–83. Springer, 2017.[7] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximumentropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 ,2018.[8] Iqbal, S. and Sha, F. Actor-attention-critic for multi-agent reinforcement learning. arXivpreprint arXiv:1810.02912 , 2018.[9] Jaques, N., Lazaridou, A., Hughes, E., Gulcehre, C., Ortega, P. A., Strouse, D., Leibo, J. Z.,and De Freitas, N. Social inﬂuence as intrinsic motivation for multi-agent deep reinforcementlearning. arXiv preprint arXiv:1810.08647 , 2018.[10] Kim, W., Cho, M., and Sung, Y. Message-dropout: An efﬁcient training method for multi-agentdeep reinforcement learning. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 33, pp. 6079–6086, 2019.[11] Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),

Proceedings of the 17thInternational Conference on Machine Learning (ICML 2000) , pp. 1207–1216, Stanford, CA,2000. Morgan Kaufmann.[12] Li, M., Qin, Z., Jiao, Y., Yang, Y., Wang, J., Wang, C., Wu, G., and Ye, J. Efﬁcient ridesharingorder dispatching with mean ﬁeld multi-agent reinforcement learning. In

The World Wide WebConference , pp. 983–994, 2019.[13] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 , 2015.[14] Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In

Machine learning proceedings 1994 , pp. 157–163. Elsevier, 1994.915] Liu, M., Zhou, M., Zhang, W., Zhuang, Y., Wang, J., Liu, W., and Yu, Y. Multi-agent interactionsmodeling with correlated policies. arXiv preprint arXiv:2001.03415 , 2020.[16] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multi-agent actor-criticfor mixed cooperative-competitive environments. In

Advances in Neural Information ProcessingSystems , pp. 6379–6390, 2017.[17] Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. Maven: Multi-agent variationalexploration. In

Advances in Neural Information Processing Systems , pp. 7611–7622, 2019.[18] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deepreinforcement learning.

Nature , 518(7540):529–533, 2015.[19] OroojlooyJadid, A. and Hajinezhad, D. A review of cooperative multi-agent deep reinforcementlearning. arXiv preprint arXiv:1908.03963 , 2019.[20] Pesce, E. and Montana, G. Improving coordination in small-scale multi-agent deep reinforce-ment learning through memory-driven communication. arXiv preprint arXiv:1901.03887 ,2019.[21] Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G., Foerster, J., and Whiteson, S. Qmix:monotonic value function factorisation for deep multi-agent reinforcement learning. arXivpreprint arXiv:1803.11485 , 2018.[22] Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. Qtran: Learning to factor-ize with transformation for cooperative multi-agent reinforcement learning. arXiv preprintarXiv:1905.05408 , 2019.[23] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot,M., Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. Value-decomposition networks for cooperativemulti-agent learning. arXiv preprint arXiv:1706.05296 , 2017.[24] Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In

Proceedingsof the tenth international conference on machine learning , pp. 330–337, 1993.[25] Wen, Y., Yang, Y., Luo, R., Wang, J., and Pan, W. Probabilistic recursive reasoning formulti-agent reinforcement learning. arXiv preprint arXiv:1901.09207 , 2019.[26] Zhang, C. and Lesser, V. Coordinating multi-agent reinforcement learning with limited com-munication. In

Proceedings of the 2013 international conference on Autonomous agents andmulti-agent systems , pp. 1101–1108, 2013.[27] Zheng, S. and Yue, Y. Structured exploration via hierarchical variational policy networks. 2018.10 ppendix A: Related Work

For cooperative MARL, several approaches have been studied. One of the approaches is valuedecomposition techniques [23, 21, 22]. For example, QMIX [21] factorizes the joint action-valuefunction into a combination of local action-value functions while imposing a monotonicity constraint.QMIX achieves state-of-the-art performance in complex discrete-action MARL tasks and has beenwidely used as a baseline in discrete-action environments. Since the focus of VM3-AC is oncontinuous-action environments, the direct comparison of VM3-AC to QMIX is irrelevant. However,the basic concept of QMIX can also be applied to the MMI framework, and this remains as futurework.Learning coordinated behavior in the multi-agent systems is studied extensively in the MARLcommunity. To promote coordination, some previous works used communication among agents[26, 3, 20]. For example, [3] proposed the DIAL algorithm to learn communication protocol thatenables the agents to coordinate their behaviors. [9] proposed the social inﬂuence intrinsic rewardwhich is related to the mutual information between actions to achieve coordination. Although thesocial inﬂuence algorithm increases the performance in challenging social dilemma environments,the limitation is that explicit dependency across actions is required and imposed for this algorithm tocompute the intrinsic reward. As already mentioned, the MMI framework can be viewed as indirectenhancement of correlated exploration. The correlated policies are considered in several other workstoo. [15] proposed the explicit modeling of correlated policies for multi-agent imitation learning,and [25] proposed a probabilistic recursive reasoning framework. By introducing a latent variableand variational lower bound on mutual information, the proposed VM3-AC increases the correlationamong policies without communication in the execution phase and without explicit dependencyacross agents’ actions.As mentioned in the main paper, the proposed MMI framework can be interpreted as enhancingcorrelated exploration by increasing the entropy of own policy while decreasing the uncertainty aboutother agents’ actions. Some previous works also proposed other techniques to enhance correlatedexploration [17, 27]. For example, MAVEN addressed the poor exploration of QMIX by maximizingthe mutual information between the latent variable and the observed trajectories [17]. However,MAVEN does not consider the correlation among policies. We compare the proposed VM3-AC withMAVEN and the comparison result is given in Appendix F.11 ppendix B: Variational policy evaluation and policy improvement

In the main paper, we deﬁned the state and state-action value functions for each agent as follows: V π i ( s ) (cid:44) E π (cid:34) ∞ (cid:88) t =0 γ t (cid:16) r t + βH ( π i ( ·| s t )) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a it , a jt , s t ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s = s (cid:35) (19) Q π i ( s, a ) (cid:44) E π (cid:34) r + γV π i ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s = s, a = a (cid:35) , (20) Lemma 3. (Variational Policy Evaluation). For ﬁxed π and the variational distribution q , considerthe modiﬁed Bellman operator T π in (21) and an arbitrary initial function Q i : S × A → R , anddeﬁne Q k +1 i = T π Q ki . Then, Q ki converges to Q π i deﬁned in (20). T π Q i ( s, a ) (cid:44) r ( s, a ) + γE s (cid:48) ∼ p [ V i ( s (cid:48) )] , (21)where V i ( s ) = E a ∼ π (cid:34) Q i ( s, a ) − β log π i ( a i | s ) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a i , a j , s ) (cid:35) (22) Proof.

Deﬁne the mutual information augmented reward as T π Q i ( s t , a t ) == r ( s t , a t ) + γE s t +1 ∼ p, a t +1 ∼ π (cid:34) Q i ( s t +1 , a t +1 ) − β log π i ( a it | s t ) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a it , a jt , s t ) (cid:35) (23) = r ( s t , a t ) + γE s t +1 ∼ p, a t +1 ∼ π (cid:34) − β log π i ( a it | s t ) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a it , a jt , s t ) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) r π ( s t , a t ) (24) + γE s t +1 ∼ p, a t +1 ∼ π (cid:34) Q i ( s t +1 , a t +1 ) (cid:35) (25) = r π ( s t , a t ) + γE s t +1 ∼ p, a t +1 ∼ π (cid:34) Q i ( s t +1 , a t +1 ) (cid:35) (26)Then, we can apply the standard convergence results for policy evaluation. Deﬁne T π ( v ) = R π + γ P π v (27)for v = [ Q ( s, a )] s ∈S , a ∈A . Then, the operator T π is a γ -contraction. (cid:107)T π ( v ) − T π ( u ) (cid:107) ∞ = (cid:107) ( R π + γ P π v ) − ( R π + γ P π u ) (cid:107) ∞ (28) = (cid:107) γ P π ( v − u ) (cid:107) ∞ (29) ≤ (cid:107) γ P π (cid:107) ∞ (cid:107) v − u (cid:107) ∞ (30) ≤ γ (cid:107) u − v (cid:107) ∞ (31)Note that the operator T π has an unique ﬁxed point by the contraction mapping theorem, and wedeﬁne the ﬁxed point as Q πi ( s, a ) . Since (cid:107) Q ki ( s, a ) − Q πi ( s, a ) (cid:107) ∞ ≤ γ (cid:107) Q k − i ( s, a ) − Q πi ( s, a ) (cid:107) ∞ ≤ · · · ≤ γ k (cid:107) Q i ( s, a ) − Q πi ( s, a ) (cid:107) ∞ , (32)we have lim k →∞ (cid:107) Q ki ( s, a ) − Q πi ( s, a ) (cid:107) ∞ = 0 (33)12nd this implies lim k →∞ Q ki ( s, a ) = Q πi ( s, a ) , ∀ ( s, a ) ∈ ( S × A ) . (34) Lemma 4. (Variational Policy Improvement). Let π inew and q new be the updated policy and thevariational distribution from (35). Then, Q π inew ,π − iold i ( s, a ) ≥ Q π iold ,π − iold i ( s, a ) for all ( s, a ) ∈ ( S × A ) . ( π ik +1 , q k +1 ) = arg max π i ,q E ( a i ,a − i ) ∼ ( π i ,π − ik ) (cid:34) Q π k i ( s, a ) − β log π i ( a i | s ) (35) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a i , a j , s )) (cid:35) , (36) Proof.

Let π new be determined as ( π inew , q new ) = arg max π i ,q E ( a it ,a − it ) ∼ ( π i ,π − iold ) (cid:34) Q π old i ( s t , a t ) − β log π i ( a it | s t ) (37) + βN (cid:88) j (cid:54) = i log q ( i,j ) ( a it , a jt , s t )) (cid:35) . (38)Then, the following inequality is hold E ( a it ,a − it ) ∼ ( π inew ,π − iold ) (cid:34) Q π old i ( s t , a t ) − β log π inew ( a it | s t ) + βN (cid:88) j (cid:54) = i log q ( i,j ) new ( a it , a jt , s t )) (cid:35) (39) ≥ E ( a it ,a − it ) ∼ ( π iold ,π − iold ) (cid:34) Q π old i ( s t , a t ) − β log π iold ( a it | s t ) + βN (cid:88) j (cid:54) = i log q ( i,j ) old ( a it , a jt , s t )) (cid:35) (40) = V π old i ( s t ) . (41)From the deﬁnition of the Bellman operator, Q π old i ( s t , a t ) = r ( s t , a t ) + γE s t +1 ∼ p [ V π old i ( s t +1 )] (42) ≤ r ( s t , a t ) + γE s t +1 ∼ p E ( a it +1 ,a − it +1 ) ∼ ( π inew ,π − iold ) (cid:34) Q π old i ( s t +1 , a t +1 ) − β log π inew ( a it +1 | s t +1 ) + β (cid:88) j (cid:54) = i log q ( i,j ) new ( a it +1 , a jt +1 , s t +1 ) (cid:35) (43) ≤ r ( s t , a t ) + γE s t +1 ∼ p E ( a it +1 ,a − it +1 ) ∼ ( π inew ,π − iold ) (cid:34) r i ( s t +1 , a t +1 ) − β log π inew ( a it +1 | s t +1 ) + β (cid:88) j (cid:54) = i log q ( i,j ) new ( a it +1 , a jt +1 , s t +1 ) + γV π old i ( s t +2 ) (cid:35) (44)... ≤ Q π inew ,π − iold i ( s t , a t ) . (45)13 ppendix C: Pseudo Code Algorithm 1

VM3-AC (L=1)

Centralized training phase

Initialize parameter φ i , θ i , ψ i , ψ i , ξ i , ∀ i ∈ { , · · · , N } for episode = 1 , , · · · do Initialize state s and each agent observes o i for t < T and s t (cid:54) = terminal do Generate z t ∼ N (0 , I ) and select action a it ∼ π i ( ·| o it , z t ) for each agent i Execute a t and each agent i receives r t and o it +1 Store transitions in D end forfor each gradient step do Sample a minibatch from D and generate z l ∼ N (0 , I ) for each transition.Update θ i , ψ i by minimizing the loss (15) and (16)Update φ i , ξ i by minimizing the loss (17) end for Update ψ i using the moving average method end forDecentralized execution phase Initialize state s and each agent observes o i for each environment step do Select action a it ∼ π i ( ·| o it , z t ) where z t = −→ Execute a t and each agent i receives o it +1 end for ppendix D: Environment Detail Multi-walker

The multi-walker environment, which was introduced in [6], is a modiﬁed version ofthe BipedalWalker environment in OpenAI gym to multi-agent setting. The environment consists of N bipedal walkers and a large package. The goal of the environment is to move forward togetherwhile holding the large package on top of the walkers. The observation of each agent consists ofthe joint angular speed, the position of joints and so on. Each agent has 4-dimensional continuousactions that control the torque of their legs. Each agent receives shared reward R depending on thedistance over which the package has moved and receives negative local compensation R if the agentdrops the package or falls to the ground. An episode ends when one of the agents falls, the packageis dropped or T time steps elapse. To obtain higher rewards, the agents should learn coordinatedbehavior. For example, if one agent only tries to learn to move forward, ignoring other agents, thenother agents may fall. In addition, the different coordinated behavior is required as the number ofagents changes. We set T = 500 , R = − and R = 10 d , where d is the distance over which thepackage has moved. We simulated this environment in three cases by changing the number of agents( N = 2 , N = 3 , and N = 4 ). Agent

Landmark (a) (b) (c)Figure 5: Considered environments: (a) Multi-walker, (b) Predator-prey, and (c) Cooperativenavigation

Predator-prey

The predator-prey environment,which is a standard task for MARL, consists of N predators and M preys. We used a variant ofthe predator-prey environment into the continuousdomain. The initial positions on the predators arerandomly determined, and those of the preys arein the shape of a square lattice as shown in ﬁgure5(b). The goal of the environment is to capture asmany preys as possible during a given time T . Aprey is captured when C predators catch the preysimultaneously. The predators get team reward R when they catch a prey. After all of the preys arecaptured and removed, we set the preys to respawn in the same position and double the team reward.Thus, the different coordinated behavior is needed as N and C change. The observation of each agentconsists of relative positions between agents and other agents and those between agents and the preys.Thus, each agent can access to all information of the environment state. The action of each agent istwo-dimensional physical action. We set R = 10 and T = 100 . We simulated the environment withthree cases: ( N = 2 , M = 16 , C = 2 ), ( N = 3 , M = 16 , C = 1) and ( N = 4 , M = 16 , C = 2) . Cooperative navigation

Cooperative navigation, which was proposed in [16], consists of N agentsand L landmarks. The goal of this environment is to occupy all landmarks while avoiding collisionwith other agents. The agent receives shared reward R which is the sum of the minimum distanceof the landmarks from any agents, and the agents who collide each other receive negative reward − R . In addition, all agents receive R if all landmarks are occupied. The observation of each agentconsists of the locations of all other agents and landmarks, and action is two-dimensional physicalaction. We set R = 10 , R = 1 , and T = 50 . We simulated the environment in the cases of ( N = 3 , L = 3 ). 15 ppendix E: Hyperparameter and Training Detail The hyperparameters for MA-AC, I-SAC, MA-SAC, MADDPG, and VM3-AC are summarized inTable 1. Table 1: Hyperparameters of all algorithms

MA-AC I-SAC MA-SAC MADDPG VM3-ACR

EPLAY BUFFER SIZE × × × × × D ISCOUNT FACTOR

INI - BATCH SIZE

128 128 128 128 128O

PTIMIZER A DAM A DAM A DAM A DAM A DAM L EARNING RATE

ARGET SMOOTHING COEFFICIENT

UMBER OF HIDDEN LAYERS ( ALL NETWORKS ) 2 2 2 2 2N

UMBER OF HIDDEN UNITS PER LAYER

128 128 128 128 128A

CTIVATION FUNCTION FOR HIDDEN LAYER R E LU R E LU R E LU R E LU R E LUA

CTIVATION FUNCTION FOR FINAL LAYER T ANH T ANH T ANH T ANH T ANH

Table 2: The temperature parameter β for I-SAC, MA-SAC, and VM3-AC on the consideredenvironments. Note that the temperature parameter β in I-SAC and MA-SAC controls the relativeimportance between the reward and the entropy, whereas the temperature parameter β in VM3-ACcontrols the relative importance between the reward and the mutual information. I-SAC MA-SAC VM3-ACMW (N=3) 0.05 0.05 0.05MW (N=4) 0.1 0.1 0.1PP (N=2) 0.05 0.05 0.05PP (N=3) 0.1 0.1 0.1PP (N=4) 0.05 0.05 0.05CN (N=3) 0.1 0.1 0.1

Table 3: The dimension of the latent variable z in VM3-AC VM3-ACMW (N=3) 8MW (N=4) 8PP (N=2) 4PP (N=3) 2PP (N=4) 4CN (N=3) 8 ppendix F: Comparison against MAVEN (a) Multi-walker (N=3) (B) Multi-walker (N=4)(C) Predator-prey (N=2) (D) Predator-prey (N=3)Figure 6: Comparison against MAVENWe compared the proposed VM3-AC algorithm with a very recent algorithm, MAVEN [17]. SinceMAVEN is based on the discrete action spaces, for comparison we applied the idea of MAVEN toactor-critic to devise a continuous action version. Then, we compared VM3-AC with this continuous-action version of MAVEN. The result is shown in Fig. 6. It is seen that VM3-AC outperforms thecontinuous-action version of MAVEN. As seen in Figure 6, the performance gain of the proposedmethod over MAVEN is noticeable and that gain in the case of predator-prey with N = 2= 2