[PDF] DFAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning

Abstract

In fully cooperative multi-agent reinforcement learning (MARL) settings, the environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of the other agents. To address the above issues, we integrate distributional RL and value function factorization methods by proposing a Distributional Value Function Factorization (DFAC) framework to generalize expected value function factorization methods to their DFAC variants. DFAC extends the individual utility functions from deterministic variables to random variables, and models the quantile function of the total return as a quantile mixture. To validate DFAC, we demonstrate DFAC's ability to factorize a simple two-step matrix game with stochastic rewards and perform experiments on all Super Hard tasks of StarCraft Multi-Agent Challenge, showing that DFAC is able to outperform expected value function factorization baselines.

Full PDF

DDFAC Framework: Factorizing the Value Function viaQuantile Mixture for Multi-Agent Distributional Q-Learning

Wei-Fang Sun

Cheng-Kuang Lee Chun-Yi Lee Abstract

In fully cooperative multi-agent reinforcementlearning (MARL) settings, the environments arehighly stochastic due to the partial observabil-ity of each agent and the continuously chang-ing policies of the other agents. To address theabove issues, we integrate distributional RL andvalue function factorization methods by propos-ing a D istributional Value Function Fac torization(DFAC) framework to generalize expected valuefunction factorization methods to their DFAC vari-ants. DFAC extends the individual utility func-tions from deterministic variables to random vari-ables, and models the quantile function of the totalreturn as a quantile mixture. To validate DFAC,we demonstrate DFAC’s ability to factorize a sim-ple two-step matrix game with stochastic rewardsand perform experiments on all

Super Hard tasksof StarCraft Multi-Agent Challenge, showing thatDFAC is able to outperform expected value func-tion factorization baselines.

1. Introduction

In multi-agent reinforcement learning (MARL), one of thepopular research directions is to enhance the training pro-cedure of fully cooperative and decentralized agents. Ex-amples of such agents include a ﬂeet of unmanned aerialvehicles (UAVs), a group of autonomous cars, etc. Thisresearch direction aims to develop a decentralized and co-operative behavior policy for each agent, and is especiallydifﬁcult for MARL settings without an explicit communi-cation channel. The most straightforward approach is inde-pendent Q-learning (IQL) (Tan, 1993), where each agent istrained independently, with their behavior policies aimedto optimize the overall rewards in each episode. Neverthe- Department of Computer Science, National Tsing HuaUniversity, Taiwan NVIDIA AI Technology Center, NVIDIACorporation Wei-Fang Sun contributed to the work duringhis NVIDIA internship. Correspondence to: Chun-Yi Lee < [email protected] > .Copyright 2021 by the author(s). less, each agent’s policy may not converge owing to twomain difﬁculties: (1) non-stationary environments causedby the changing behaviors of the agents, and (2) spurious re-ward signals originated from the actions of the other agents.The agent’s partial observability of the environment furtherexacerbates the above issues.Therefore, in the past few years, a number of MARL re-searchers turned their attention to centralized training withdecentralized execution (CTDE) approaches, with an objec-tive to stabilize the training procedure while maintaining theagents’ abilities for decentralized execution (Oliehoek et al.,2016). Among these CTDE approaches, value functionfactorization methods (Sunehag et al., 2018; Rashid et al.,2020; Son et al., 2019) are especially promising in terms oftheir superior performances and data efﬁciency (Samvelyanet al., 2019).Value function factorization methods introduce the assump-tion of individual-global-max (IGM) (Son et al., 2019),which assumes that each agent’s optimal actions result inthe optimal joint actions of the entire group. Based on IGM,the total return of a group of agents can be factorized intoseparate utility functions (Guestrin et al., 2001) (or simply‘ utility ’ hereafter) for each agent. The utilities allow theagents to independently derive their own optimal actionsduring execution, and deliver promising performance in Star-Craft Multi-Agent Challenge (SMAC) (Samvelyan et al.,2019). Unfortunately, current value function factorizationmethods only concentrate on estimating the expectationsof the utilities, overlooking the additional information con-tained in the full return distributions. Such information,nevertheless, has been demonstrated beneﬁcial for policylearning in the recent literature (Lyle et al., 2019).In the past few years, distributional RL has been empiri-cally shown to enhance value function estimation in varioussingle-agent RL (SARL) domains (Bellemare et al., 2017;Dabney et al., 2018b;a; Rowland et al., 2019; Yang et al.,2019). Instead of estimating a single scalar Q-value, itapproximates the probability distribution of the return byeither a categorical distribution (Bellemare et al., 2017) or aquantile function (Dabney et al., 2018b;a). Even though theabove methods may be beneﬁcial to the MARL domain dueto the ability to capture uncertainty, it is inherently incompat- a r X i v : . [ c s . M A ] F e b FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning ible to expected value function factorization methods (e.g.,value decomposition network (VDN) (Sunehag et al., 2018)and QMIX (Rashid et al., 2020)). The incompatibility arisesfrom two aspects: (1) maintaining IGM in a distributionalform, and (2) factorizing the probability distribution of thetotal return into individual utilities. As a result, an effectiveand efﬁcient approach that is able to solve the incompati-bility is crucial and necessary for bridging the gap betweenvalue function factorization methods and distributional RL.In this paper, we propose a D istributional Value Function Fac torization (DFAC) framework, to efﬁciently integratevalue function factorization methods with distributional RL.DFAC solves the incompatibility by two techniques: (1)Mean-Shape Decomposition and (2) Quantile Mixture. Theformer allows the generalization of expected value functionfactorization methods (e.g., VDN and QMIX) to their DFACvariants without violating IGM. The latter allows the totalreturn distribution to be factorized into individual utility dis-tributions in a computationally efﬁcient manner. To validatethe effectiveness of DFAC, we ﬁrst demonstrate the abil-ity of distribution factorization on a two-step matrix gamewith stochastic rewards. Then, we perform experiments onall

Super Hard maps in SMAC. The experimental resultsshow that DFAC offers beneﬁcial impacts on the baselinemethods in all

Super Hard maps. In summary, the primarycontribution is the introduction of DFAC for bridging thegap between distributional RL and value function factoriza-tion methods efﬁciently by mean-shape decomposition andquantile mixture.

2. Background and Related Works

In this section, we introduce the essential background ma-terial for understanding the contents of this paper. We ﬁrstdeﬁne the problem formulation of cooperative MARL andCTDE. Next, we describe the conventional formulation ofIGM and the value function factorization methods. Then,we walk through the concepts of distributional RL, quan-tile function, as well as quantile regression, which are thefundamental concepts frequently mentioned in this paper.After that, we explain the implicit quantile network, a keyapproach adopted in this paper for approximating quantiles.Finally, we bring out the concept of quantile mixture, whichis leveraged by DFAC for factorizing the return distribution.

In this work, we consider a fully cooperative MARL envi-ronment modeled as a decentralized and partially observableMarkov Decision Process (Dec-POMDP) (Oliehoek & Am-ato, 2016) with stocastic rewards, which is described as atuple (cid:104) S , K , O jt , U jt , P , O, R, γ (cid:105) and is deﬁned as follows:• S is the ﬁnite set of global states in the environment, where s (cid:48) ∈ S denotes the next state of the current state s ∈ S . The state information is optionally availableduring training, but not available to the agents duringexecution.• K = { , ..., K } is the set of K agents. We use k ∈ K to denote the index of the agent.• O jt = Π k ∈ K O k is the set of joint observations. At eachtimestep, a joint observation o = (cid:104) o , ...o K (cid:105) ∈ O jt isreceived. Each agent k is only able to observe itsindividual observation o k ∈ O k .• H jt = Π k ∈ K H k is the set of joint observation histories.The joint observation history h = (cid:104) h , ...h K (cid:105) ∈ H jt concatenates all received observations and performedactions before a certain timestep, where h k ∈ H k rep-resents the observation history from agent k .• U jt = Π k ∈ K U k is the set of joint actions. At eachtimestep, the entire group of the agents take a joint ac-tion u , where u = (cid:104) u , ..., u K (cid:105) ∈ U jt . The individualaction u k ∈ U k of each agent k is determined basedon its stochastic policy π k ( u k | h k ) : H k × U k → [0 , ,expressed as u k ∼ π k ( ·| h k ) . Similarly, in single agentscenarios, we use u and u (cid:48) to denote the actions of theagent at state s and s (cid:48) under policy π , respectively.• T = { , ..., T } represents the set of timesteps withhorizon T , where the index of the current timestep isdenoted as t ∈ T . s t , o t , h t , and u t correspond to theenvironment information at timestep t .• The transition function P ( s (cid:48) | s, u ) : S × U jt × S → [0 , speciﬁes the state transition probabilities. Given s and u , the next state is denoted as s (cid:48) ∼ P ( ·| s, u ) .• The observation function O ( o | s ) : O jt × S → [0 , speciﬁes the joint observation probabilities. Given s ,the joint observation is represented as o ∼ O ( ·| s ) .• R ( r | s, u ) : S × U jt × R → [0 , is the joint rewardfunction shared among all agents. Given s , the teamreward is expressed as r ∼ R ( ·| s, u ) .• γ ∈ R is the discount factor with value within (0 , .Under such an MARL formulation, this work concentrateson CTDE value function factorization methods, where theagents are trained in a centralized fashion and executed ina decentralized manner. In other words, the joint observa-tion history h is available during the learning processes ofindividual policies [ π k ] k ∈ K . During execution, each agent’spolicy π k only conditions on its observation history h k . IGM is necessary for value function factorization (Sonet al., 2019). For a joint action-value function Q jt ( h , u ) : H jt × U jt → R , if there exist K individual utility functions [ Q k ( h k , u k ) : H k × U k → R ] k ∈ K such that the following FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning condition holds: arg max u Q jt ( h , u ) =  arg max u Q ( h , u ) ... arg max u K Q K ( h K , u K )  , (1)then [ Q k ] k ∈ K are said to satisfy IGM for Q jt under h .In this case, we also say that Q jt ( h , u ) is factorized by [ Q k ( h k , u k )] k ∈ K (Son et al., 2019). If Q jt in a given task isfactorizable under all h ∈ H jt , we say that the task is fac-torizable. Intuitively, factorizable tasks indicate that thereexists a factorization such that each agent can select thegreedy action according to their individual utilities [ Q k ] k ∈ K independently in a decentralized fashion. This enables theoptimal individual actions to implicitly achieve the optimaljoint action across the K agents. Since there is no individ-ual reward, the factorized utilities do not estimate expectedreturns on their own (Guestrin et al., 2001) and are differentfrom the value function deﬁnition commonly used in SARL. Based on IGM, value function factorization methods enablecentralized training for factorizable tasks, while maintainingthe ability for decentralized execution. In this work, weconsider two such methods, VDN and QMIX, which cansolve a subset of factorizable tasks that satisﬁes

Additivity (Eq. (2)) and

Monotonicity (Eq. (3)), respectively, given by: Q jt ( h , u ) = K (cid:88) k =1 Q k ( h k , u k ) , (2) Q jt ( h , u ) = M ( Q ( h , u ) , ..., Q K ( h K , u K ) | s ) , (3)where M is a monotonic function that satisﬁes ∂M∂Q k ≥ , ∀ k ∈ K , and conditions on the state s if the informationis available during training. Either of these two equation isa sufﬁcient condition for IGM (Son et al., 2019). For notational simplicity, we consider a degenerated casewith only a single agent, and the environment is fully ob-servable until the end of Section 2.6. Distributional RLgeneralizes classic expected RL methods by capturing thefull return distribution Z ( s, u ) instead of the expected return Q ( s, u ) , and outperforms expected RL methods in varioussingle-agent RL domains (Bellemare et al., 2017; 2019; Dab-ney et al., 2018b;a; Rowland et al., 2019; Yang et al., 2019).Moreover, distributional RL enables improvements (Nikolovet al., 2019; Zhang & Yao, 2019; Mavrin et al., 2019) thatrequire the information of the full return distribution. Wedeﬁne the distributional Bellman operator T π as follows: T π Z ( s, u ) D := R ( s, u ) + γZ ( s (cid:48) , u (cid:48) ) , (4) and the distributional Bellman optimality operator T ∗ as: T ∗ Z ( s, u ) D := R ( s, u ) + γZ ( s (cid:48) , u (cid:48)∗ ) , (5)where u (cid:48)∗ = arg max u (cid:48) E [ Z ( s (cid:48) , u (cid:48) )] is the optimal actionat state s (cid:48) , and the expression X D = Y denotes that randomvariable X and Y follow the same distribution. Given someinitial distribution Z , Z converges to the return distribution Z π under π , contracting in terms of p -Wasserstein distancefor all p ∈ [1 , ∞ ) by applying T π repeatedly; while Z alternates between the optimal return distributions in theset Z ∗ := { Z π ∗ : π ∗ ∈ Π ∗ } , under the set of optimalpolicies Π ∗ by repeatedly applying T ∗ (Bellemare et al.,2017). The p -Wasserstein distance W p between the proba-bility distributions of random variables X , Y is given by: W p ( X, Y ) = (cid:18)(cid:90) | F − X ( τ ) − F − Y ( τ ) | p d τ (cid:19) /p , (6)where ( F − X , F − Y ) are quantile functions of ( X, Y ) . The relationship between the cumulative distribution func-tion (CDF) F X and the quantile function F − X (the general-ized inverse CDF) of random variable X is formulated as: F − X ( τ ) = inf { x ∈ R : τ ≤ F X ( x ) } , ∀ τ ∈ [0 , . (7)The expectation of X expressed in terms of F − X ( τ ) is: E [ X ] = (cid:90) F − X ( τ ) d τ. (8)In (Dabney et al., 2018b), the authors model the value func-tion as a quantile function F − ( s, u | τ ) . During optimiza-tion, a pair-wise sampled temporal difference (TD) error δ for two quantile samples τ , τ (cid:48) ∼ U ([0 , is deﬁned as: δ τ,τ (cid:48) t = r + γF − ( s (cid:48) , u (cid:48) | τ (cid:48) ) − F − ( s, u | τ ) . (9)The pair-wise loss ρ κτ is then deﬁned based on the Huberquantile regression loss L κ (Dabney et al., 2018b) withthreshold κ = 1 , and is formulated as follows: ρ κτ ( δ τ,τ (cid:48) ) = | τ − I { δ τ,τ (cid:48) < }| L κ ( δ τ,τ (cid:48) ) κ , with (10) L κ ( δ τ,τ (cid:48) ) = (cid:40) ( δ τ,τ (cid:48) ) , if | δ τ,τ (cid:48) | ≤ κκ ( | δ τ,τ (cid:48) | − κ ) , otherwise . (11)Given N quantile samples [ τ i ] N i =1 to be optimized with re-gard to N (cid:48) target quantile samples [ τ j ] N (cid:48) j =1 , the total loss L ( s, u, r, s (cid:48) ) is deﬁned as the sum of the pair-wise losses,and is expressed as the following: L ( s, u, r, s (cid:48) ) = 1N (cid:48) N (cid:88) i =1 N (cid:48) (cid:88) j =1 ρ κτ i ( δ τ i ,τ (cid:48) j ) . (12) FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning

Implicit quantile network (IQN) (Dabney et al., 2018a) isrelatively light-weight when it is compared to other distribu-tional RL methods. It approximates the return distribution Z ( s, u ) by an implicit quantile function F − ( s, u | τ ) = g ( ψ ( s ) , φ ( τ )) u for some differentiable functions g , ψ , and φ . Such an architecture is a type of universal value functionapproximator (UVFA) (Schaul et al., 2015), which general-izes its predictions across states s ∈ S and goals τ ∈ [0 , ,with the goals deﬁned as different quantiles of the returndistribution. In practice, φ ﬁrst expands the scalar τ toan n -dimensional vector by [cos( πiτ )] n − i =0 , followed by asingle hidden layer with weights [ w ij ] and biases [ b j ] toproduce a quantile embedding φ ( τ ) = [ φ ( τ ) j ] dim ( φ ( τ )) − j =0 .The expression of φ ( τ ) j can be represented as the following: φ ( τ ) j := ReLU ( n − (cid:88) i =0 cos( πiτ ) w ij + b j ) , (13)where n = 64 and dim ( φ ( τ )) = dim ( ψ ( s )) . Then, φ ( τ ) iscombined with the state embedding ψ ( s ) by the element-wise (Hadamard) product ( (cid:12) ), expressed as g := ψ (cid:12) φ . Theloss of IQN is deﬁned as Eq. (12) by sampling a batch of N and N (cid:48) quantiles from the policy network and the targetnetwork respectively. During execution, the action withthe largest expected return Q ( s, u ) is chosen. Since IQNdoes not model the expected return explicitly, Q ( s, u ) isapproximated by calculating the mean of the sampled returnthrough ˆN quantile samples ˆ τ i ∼ U ([0 , , ∀ i ∈ [1 , ˆN] based on Eq. (8), expressed as follows: Q ( s, u ) = (cid:90) F − ( s, u | τ ) d τ ≈ ˆN (cid:88) i =1 F − ( s, u | ˆ τ i ) . (14) Multiple quantile functions (e.g., IQNs) sharing the samequantile τ may be combined into a single quantile function F − ( τ ) , in a form of quantile mixture expressed as follows: F − ( τ ) = K (cid:88) k =1 β k F − k ( τ ) , (15)where [ F − k ( τ )] k ∈ K are quantile functions, and [ β k ] k ∈ K are model parameters (Karvanen, 2006). The conditionfor [ β k ] k ∈ K is that F − ( τ ) must satisfy the properties ofa quantile function. The concept of quantile mixture isanalogous to the mixture of multiple probability densityfunctions (PDFs), expressed as follows: f ( x ) = K (cid:88) k =1 α k f k ( x ) , (16) where [ f k ( x )] k ∈ K are PDFs, (cid:80) K k =1 α k = 1 , and α k ≥ .

3. Methodology

In this section, we walk through the proposed DFAC frame-work and its derivation procedure. We ﬁrst discuss a naivedistributional factorization and its limitation in Section 3.1.Then, we introduce the DFAC framework to address thelimitation, and show that DFAC is able to generalize distri-butional RL to all factorizable tasks in Section 3.2. Afterthat,

DDN and

DMIX are introduced as the DFAC variantsof VDN and QMIX, respectively, in Section 3.4. Finally, apractical implementation of DFAC based on quantile mix-ture is presented in Section 3.3.

Since IGM is necessary for value function factorization, adistributional factorization that satisﬁes IGM is required forfactorizing stochastic value functions. We ﬁrst discuss anaive distributional factorization that simply replaces de-terministic utilities Q with stochastic utilities Z . Then, weprovide a theorem to show that the naive distributional fac-torization is insufﬁcient to guarantee the IGM condition. Deﬁnition 1 (Distributional IGM) . A ﬁnite number of indi-vidual stochastic utilities [ Z k ( h k , u k )] k ∈ K , are said to sat-isfy Distributional IGM (DIGM) for a stochastic joint action-value function Z jt ( h , u ) under h , if [ E [ Z k ( h k , u k )]] k ∈ K sat-isfy IGM for E [ Z jt ( h , u )] under h , represented as: arg max u E [ Z jt ( h , u )] =  arg max u E [ Z ( h , u )] ... arg max u K E [ Z K ( h K , u K )]  . Theorem 1.

Given a deterministic joint action-value func-tion Q jt , a stochastic joint action-value function Z jt , and afactorization function Ψ for deterministic utilities: Q jt ( h , u ) = Ψ( Q ( h , u ) , ..., Q K ( h K , u K ) | s ) , such that [ Q k ] k ∈ K satisfy IGM for Q jt under h , the follow-ing distributional factorization: Z jt ( h , u ) = Ψ( Z ( h , u ) , ..., Z K ( h K , u K ) | s ) . is insufﬁcient to guarantee that [ Z k ] k ∈ K satisfy DIGM for Z jt under h . In order to satisfy

DIGM for stochastic utilities, an alterna-tive factorization strategy is necessary.

We propose Mean-Shape Decomposition and the DFACframework to ensure that

DIGM is satisﬁed.

Deﬁnition 2 (Mean-Shape Decomposition) . A given ran-

FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning dom variable Z can be decomposed as follows: Z = E [ Z ] + ( Z − E [ Z ])= Z mean + Z shape , where Var( Z mean ) = 0 and E [ Z shape ] = 0 . We propose DFAC to decompose a joint return distribution Z jt into its deterministic part Z mean (i.e., expected value)and stochastic part Z shape (i.e., higher moments), which areapproximated by two different functions Ψ and Φ , respec-tively. The factorization function Ψ is required to preciselyfactorize the expectation of Z jt in order to satisfy DIGM . Onthe other hand, the shape function Φ is allowed to roughlyfactorize the shape of Z jt , since the main objective of mod-eling the return distribution is to assist non-linear approxi-mation of the expectation of Z jt (Lyle et al., 2019), ratherthan accurately model the shape of Z jt . Theorem 2 (DFAC Theorem) . Given a deterministic jointaction-value function Q jt , a stochastic joint action-valuefunction Z jt , and a factorization function Ψ for determinis-tic utilities: Q jt ( h , u ) = Ψ( Q ( h , u ) , ..., Q K ( h K , u K ) | s ) , such that [ Q k ] k ∈ K satisfy IGM for Q jt under h , by Mean-Shape Decomposition, the following distributional factor-ization: Z jt ( h , u ) = E [ Z jt ( h , u )] + ( Z jt ( h , u ) − E [ Z jt ( h , u )])= Z mean ( h , u ) + Z shape ( h , u )= Ψ( Q ( h , u ) , ..., Q K ( h K , u K ) | s )+ Φ( Z ( h , u ) , ..., Z K ( h K , u K ) | s ) . is sufﬁcient to guarantee that [ Z k ] k ∈ K satisfy DIGM for Z jt under h , where Var(Ψ) = 0 and E [Φ] = 0 . Theorem. 2 reveals that the choice of Ψ determines whetherIGM holds, regardless of the choice of Φ , as long as E [Φ] = 0 . Under this setting, any differentiable factor-ization function of deterministic variables can be extendedto a factorization function of random variables. Such a de-composition enables approximation of joint distributions forall factorizable tasks under appropriate choices of Ψ and Φ . In this section, we provide a practical implementation ofthe shape function Φ in DFAC, effectively extending anydifferentiable factorization function Ψ (e.g., the additivefunction of VDN, the monotonic mixing network of QMIX,etc.) that satisﬁes the IGM condition into its DFAC variant.Theoretically, the sum of random variables appeared in DDN and

DMIX can be described precisely by a joint CDF. How-ever, the exact derivation of this joint CDF is usually com-putationally expensive and impractical (Lin et al., 2019). As a result, DFAC utilizes the property of quantile mixture toapproximate the shape function Φ in O(KN) time.

Theorem 3.

Given a quantile mixture: F − ( τ ) = K (cid:88) k =1 β k F − k ( τ ) with K components [ F − k ] k ∈ K and non-negative model pa-rameters [ β k ] k ∈ K . There exist a set of random variables Z and [ Z k ] k ∈ K corresponding to the quantile functions F − and [ F − k ] k ∈ K , respectively, with the following relationship: Z = (cid:88) k ∈ K β k Z k . Based on Theorem 3, the quantile function F − of Z shape in DFAC can be approximated by the following: F − ( h , u | τ ) = F − state ( s | τ )+ (cid:88) k ∈ K β k ( s )( F − k ( h k , u k | τ ) − Q k ( h k , u k )) , (17)where F − state ( s | τ ) and [ β k ( s )] k ∈ K are respectivelygenerated by function approximators Λ state ( s | τ ) and [Λ k ( s )] k ∈ K , satisfying constraints β k ( s ) ≥ , ∀ k ∈ K and (cid:82) F − state ( s | τ ) d τ = 0 . The term F − state models the shape ofan additional state-dependent utility (introduced by QMIXat the last layer of the mixing network), which extends thestate-dependent bias in QMIX to a full distribution. Thefull network architecture of DFAC is illustrated in Fig. 1.This transformation enables DFAC to decompose the quan-tile representation of a joint distribution into the quantilerepresentations of individual utilities. In this work, Φ isimplemented by a large IQN composed of multiple IQNs,optimized through the loss function deﬁned in Eq. (12). In order to validate the proposed DFAC framework, we nextdiscuss the DFAC variants of two representative factoriza-tion methods: VDN and QMIX.

DDN extends VDN to itsDFAC variant, expressed as: Z jt = (cid:88) k ∈ K Q k + (cid:88) k ∈ K ( Z k − Q k ) , given (18) Z mean = (cid:80) k ∈ K Q k , Z shape = (cid:80) k ∈ K ( Z k − Q k ) ; while DMIX extends QMIX to its DFAC variant, expressed as: Z jt = M ( Q , ..., Q K | s ) + (cid:88) k ∈ K ( Z k − Q k ) , given (19) Z mean = M ( Q , ..., Q K | s ) , Z shape = (cid:80) k ∈ K ( Z k − Q k ) .Both DDN and

DMIX choose F − state = 0 and [ β k = 1] k ∈ K for simplicity. Since DMIX conditions on s , while DDN does not.

FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning ⋅ MLPGRU MLP τ ⋅ ...... Agent 1 Agent K τ FactorizationNetwork ...

ShapeNetwork -- ...+

ParameterNetworks + ⋅ Figure 1: The DFAC framework consists of a factorization network Ψ and a shape network Φ for decomposing thedeterministic part Z mean (i.e., Q jt ) and the stochastic part Z shape of the total return distribution Z jt , as described inTheorem 2. The shape network contains parameter networks Λ state ( s ; τ ) and [Λ k ( s )] k ∈ K for generating Z state ( s ) and β k ( s ) .

4. A Stochastic Two-Step Game

In the previous expected value function factorization meth-ods (e.g., VDN, QMIX, etc.), the factorization is achievedby modeling Q jt and [ Q k ] k ∈ K as deterministic variables,overlooking the information of higher moments in the fullreturn distributions Z jt and [ Z k ] k ∈ K . In order to demon-strate DFAC’s ability of factorization, we begin with a toyexample modiﬁed from (Rashid et al., 2020) to show thatDFAC is able to approximate the true return distributions,and factorize the mean and variance of the approximatedtotal return Z jt into utilities [ Z k ] k ∈ K . Table 1 illustratesthe ﬂow of a two-step game consisting of two agents andthree states , , and , where State 1 serves as theinitial state, and each agent is able to perform an action from { A, B } in each step. In the ﬁrst step (i.e., State 1 ), theaction of agent (i.e., actions A or B ) determines whichof the two matrix games ( State 2A or State 2B ) toplay in the next step, regardless of the action performed byagent (i.e., actions A or B ). For all joint actions per-formed in the ﬁrst step, no reward is provided to the agents.In the second step, both agents choose an action and receivea global reward according to the payoff matrices depictedin Table 1, where the global rewards are sampled from anormal distribution N ( µ, σ ) with mean µ and standarddeviation σ .Table 2 presents the learned factorization of DMIX for eachstate after convergence, where the ﬁrst rows and the ﬁrstcolumns of the tables correspond to the factorized distri-butions of the individual utilities (i.e., Z and Z ), and themain content cells of them correspond to the joint returndistributions (i.e., Z jt ). From Tables 2(b) and 2(c), it is observed that no matter thetrue returns are deterministic (i.e., State 2A ) or stochastic(i.e.,

State 2B ), DMIX is able to approximate the truereturns in Table 1 properly, which are not achievable byexpected value function factorization methods. The resultsdemonstrate DFAC’s ability to factorize the joint returndistribution rather than expected returns.To further illustrate DFAC’s capability of factorization,Figs. 2(a)-2(b) visualize the factorization of the joint ac-tion (cid:104) B , B (cid:105) in State A and (cid:104) B , B (cid:105) in State B ,respectively. As IQN approximates the utilities Z and Z implicitly, Z , Z , and Z jt can only be plotted in termsof samples. Z jt in Fig. 2(a) shows that DMIX degeneratesto QMIX when approximating deterministic returns (i.e., N (7 , ), while Z jt in Fig. 2(b) exhibits DMIX ’s ability tocapture the uncertainty in stochastic returns (i.e., N (8 , ).

5. Experimental Results on SMAC

In this section, we present the experimental results and dis-cuss their implications. We start with a brief introductionto our experimental setup in Section 5.1. Then, we demon-strate that modeling a full distribution is beneﬁcial to theperformance of independent learners in Section 5.2. Fi-nally, we compare the performances of the CTDE baselinemethods and their DFAC variants in Section 5.3.

We verify the DFAC framework in theSMAC benchmark environments (Samvelyan et al., 2019)

FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning

Table 1: An illustration of the ﬂow of the stochastic two-step game. Each agent is able to perform an action from { A, B } in each step, with a subscript denoting the agentindex. In the ﬁrst step, action A takes the agents from theinitial State 1 to State 2A , while action B takesthem to State 2B instead. The transitions from

State1 to State 2A or State 2B yield zero reward. Inthe second step, the global rewards are sampled from thenormal distributions deﬁned in the payoff matrices.

State A B Agent A B A g e n t A N (7 , N (7 , B N (7 , N (7 , State A Agent A B A N (0 , N (1 , B N (1 , N (8 , State B Z GT Z Z Z jt Q u a n t il e (a) (cid:104) B , B (cid:105) at State A Q u a n t il e (b) (cid:104) B , B (cid:105) at State B Figure 2: (a) and (b) plot the value function factorization ofthe joint action (cid:104) B , B (cid:105) in State 2A and

State 2B .The black line/curve shows the true return CDFs. Theblue circles and the orange cross marks depict agent 1’sand agent 2’s learned utility, respectively, while the greensquares indicate the estimated joint return.Table 2: The learned factorization of

DMIX . All of the cells show the sampled mean µ and the sampled variance σ withBessel’s correction. The main content cells correspond to the joint return distributions for different combinations of statesand actions. The ﬁrst columns and ﬁrst rows of these tables correspond to the distributions of the utilities for agents 1 and 2,respectively. The top-left cells of these tables are the state-dependent utility V . DFAC enables the approximation of the truejoint return distributions in Table 1, and allows them to be factorized into the distributions of the utilities for the agents. State A B V µ = − . σ = 0 . µ = 2 . σ = 0 . µ = 2 . σ = 0 . A µ = 2 . σ = 0 . µ = 6 . σ = 0 . µ = 6 . σ = 0 . B µ = 3 . σ = 19 . µ = 7 . σ = 21 . µ = 7 . σ = 21 . (a) Learned utilities of State 1

State A A B µ = 0 . σ = 0 . µ = 1 . σ = 0 . µ = 1 . σ = 0 . µ = 2 . σ = 0 . µ = 7 . σ = 0 . µ = 6 . σ = 0 . µ = 2 . σ = 0 . µ = 7 . σ = 0 . µ = 6 . σ = 0 . (b) Learned utilities of State 2A

State B A B µ = 0 . σ = 0 . µ = − . σ = 0 . µ = 3 . σ = 5 . µ = − . σ = 0 . µ = − . σ = 1 . µ = 1 . σ = 9 . µ = 3 . σ = 6 . µ = 1 . σ = 9 . µ = 8 . σ = 25 . (c) Learned utilities of State 2B built on the popular real-time strategy game StarCraft II.Instead of playing the full game, SMAC is developed forevaluating the effectiveness of MARL micro-managementalgorithms. Each environment in SMAC contains two teams.One team is controlled by a decentralized MARL algorithm,with the policies of the agents conditioned on their localobservation histories. The other team consists of enemyunits controlled by the built-in game artiﬁcial intelligencebased on carefully handcrafted heuristics, which is set toits highest difﬁculty equal to seven. The overall objectiveis to maximize the win rate for each battle scenario, wherethe rewards employed in our experiments follow the defaultsettings of SMAC. The default settings use shaped rewards based on the damage dealt, enemy units killed, and whetherthe RL agents win the battle. If there is no healing unit inthe enemy team, the maximum return of an episode (i.e.,the score) is ; otherwise, it may exceed , since enemiesmay receive more damages after healing or being healed.The environments in SMAC are categorized into three dif-ferent levels of difﬁculties: Easy , Hard , and

Super Hard scenarios (Samvelyan et al., 2019). In this paper, we fo-cus on all

Super Hard scenarios including (a)

6h vs 8z , (b) , (c)

MMM2 , (d)

27m vs 30m , and (e) corridor , since these scenarios have not been properlyaddressed in the previous literature without the use of addi-tional assumptions such as intrinsic reward signals (Du et al.,2019), explicit communication channels (Zhang et al., 2019;Wang et al., 2019), common knowledge shared among theagents (de Witt et al., 2019; Wang et al., 2020), and so on.Three of these scenarios have their maximum scores higherthan . In , the enemy Stalkers have theability to regenerate shields; in

MMM2 , the enemy

Medivacs can heal other units; in corridor , the enemy

Zerglings slowly regenerate their own health.

Hyperparameters.

For all of our experimental results, thetraining length is set to 8M timesteps, where the agents areevaluated every 40k timesteps with 32 independent runs.The curves presented in this section are generated basedon ﬁve different random seeds. The solid lines representthe median win rate, while the shaded areas correspond tothe th to th percentiles. For a better visualization, thepresented curves are smoothed by a moving average ﬁlterwith its window size set to 11. Baselines.

We select IQL, VDN, and QMIX as our baseline

FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning

IQL VDN QMIX DIQL DDN DMIX

Timestep M e d i a n T e s t W i n % (a)

6h vs 8z

Timestep (b)

Timestep (c)

MMM2

Timestep (d)

27m vs 30m

Timestep (e) corridor

Figure 3: The win rate curves evaluated on the ﬁve

Super Hard maps in SMAC for different CTDE methods.Table 3: The median win rate % of ﬁve independent test runs. Map IQL VDN QMIX DIQL DDN DMIX (a) (b) (c) (d) (e)

Table 4: The averaged scores of ﬁve independent test runs.

Map IQL VDN QMIX DIQL DDN DMIX (a) (b) (c) (d) (e) methods, and compare them with their distributional variantsin our experiments. The conﬁgurations are optimized soas to provide the best performance for each of the methodsconsidered. Since we tuned the hyperparameters of thebaselines, their performances are better than those reportedin (Samvelyan et al., 2019).

In order to validate our assumption that distributional RLis beneﬁcial to the MARL domain, we ﬁrst employ thesimplest training algorithm, IQL, and extend it to its distri-butional variant, called

DIQL . DIQL is simply a modiﬁedIQL that uses IQN as its underlying RL algorithm withoutany additional modiﬁcation or enhancements (Matignonet al., 2007; Lyu & Amato, 2020).From Figs. 3(a)-3(e) and Tables 3-4, it is observed that

DIQL is superior to IQL even without utilizing any value functionfactorization methods. This validates that distributional RLhas beneﬁcial inﬂuences on MARL, when it is compared toRL approaches based only on expected values.

In order to inspect the effectiveness and impacts of DFAC onlearning curves, win rates, and scores, we next summarizethe results of the baselines as well as their DFAC variantson the

Super Hard scenarios in Figs. 3(a)-(e) and Table 3-4.Figs. 3(a)-(e) plot the learning curves of the baselines andtheir DFAC variants, with the ﬁnal win rates presented inTable 3, and their ﬁnal scores reported in Table 4. Thewin rates indicate how often do the player’s team wins,while the scores represent how well do the player’s teamperforms. Despite the fact that SMAC’s objective is tomaximize the win rate, the true optimization goal of MARL algorithms is the averaged score. In fact, these two metricsare not always positively correlated (e.g., VDN and QMIXin

6h vs 8z and , and QMIX and

DMIX in ).It can be observed that the learning curves of DDN and

DMIX grow faster and achieve higher ﬁnal win rates thantheir corresponding baselines. In the most difﬁcult map:

6h vs 8z , most of the methods fail to learn an effectivepolicy except for

DDN and

DMIX . The evaluation resultsalso show that

DDN and

DMIX are capable of performingconsistently well across all

Super Hard maps with high winrates. In addition to the win rates, Table 4 further presentsthe ﬁnal averaged scores achieved by each method, andprovides deeper insights into the advantages of the DFACframework by quantifying the performances of the learnedpolicies of different methods.The improvements in win rates and scores are due to thebeneﬁts offered by distributional RL (Lyle et al., 2019),which enables the distributional variants to work more ef-fectively in MARL environments. Moreover, the evaluationresults reveal that

DDN performs especially well in mostenvironments despite its simplicity.

6. Conclusion

In this paper, we provided a distributional perspective onvalue function factorization methods, and introduced aframework, called DFAC, for integrating distributional RLwith MARL domains. We ﬁrst proposed DFAC based on amean-shape decomposition procedure to ensure the Distribu-tional IGM condition holds for all factorizable tasks. Then,we proposed the use of quantile mixture to implement themean-shape decomposition in a computationally friendlymanner. DFAC’s ability to factorize the joint return distribu-tion into individual utility distributions was demonstrated

FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning by a toy example. In order to validate the effectivenessof DFAC, we presented experimental results performed onall

Super Hard scenarios in SMAC for a number of MARLbaseline methods as well as their DFAC variants. The resultsshow that

DDN and

DMIX outperform VDN and QMIX.DFAC can be extended to more value function factoriza-tion methods and offers an interesting research direction forfuture endeavors.

References

Bellemare, M. G., Dabney, W., and Munos, R. A distribu-tional perspective on reinforcement learning. In

Proceed-ings of the 34th International Conference on MachineLearning-Volume 70 , pp. 449–458. JMLR. org, 2017.Bellemare, M. G., Roux, N. L., Castro, P. S., and Moitra, S.Distributional reinforcement learning with linear functionapproximation. arXiv preprint arXiv:1902.03149 , 2019.Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Im-plicit quantile networks for distributional reinforcementlearning. In

International conference on machine learn-ing , pp. 1096–1105. PMLR, 2018a.Dabney, W., Rowland, M., Bellemare, M. G., and Munos,R. Distributional reinforcement learning with quantile re-gression. In

Thirty-Second AAAI Conference on ArtiﬁcialIntelligence , 2018b.de Witt, C. S., Foerster, J., Farquhar, G., Torr, P., B¨ohmer,W., and Whiteson, S. Multi-agent common knowledge re-inforcement learning. In

Advances in Neural InformationProcessing Systems , pp. 9924–9935, 2019.Du, Y., Han, L., Fang, M., Liu, J., Dai, T., and Tao, D. Liir:Learning individual intrinsic reward in multi-agent rein-forcement learning. In

Advances in Neural InformationProcessing Systems , pp. 4405–4416, 2019.Guestrin, C., Koller, D., and Parr, R. Multiagent planningwith factored mdps. In

NIPS , 2001.Karvanen, J. Estimation of quantile mixtures via l-momentsand trimmed l-moments.

Computational Statistics &Data Analysis , 51:947–959, 11 2006. doi: 10.1016/j.csda.2005.09.014.Lin, Z., Zhao, L., Yang, D., Qin, T., Liu, T.-Y., and Yang, G.Distributional reward decomposition for reinforcementlearning. In

Advances in Neural Information ProcessingSystems , pp. 6212–6221, 2019.Lyle, C., Bellemare, M. G., and Castro, P. S. A compara-tive analysis of expected and distributional reinforcementlearning. In

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , volume 33, pp. 4504–4511, 2019. Lyu, X. and Amato, C. Likelihood quantile networks forcoordinating multi-agent reinforcement learning. In

Pro-ceedings of the 19th International Conference on Au-tonomous Agents and MultiAgent Systems , pp. 798–806,2020.Matignon, L., Laurent, G., and Fort-Piat, N. Hystereticq-learning : an algorithm for decentralized reinforcementlearning in cooperative multi-agent teams. pp. 64 – 69,12 2007. doi: 10.1109/IROS.2007.4399095.Mavrin, B., Zhang, S., Yao, H., Kong, L., Wu, K., andYu, Y. Distributional reinforcement learning for efﬁcientexploration. arXiv preprint arXiv:1905.06125 , 2019.Nikolov, N., Kirschner, J., Berkenkamp, F., and Krause,A. Information-directed exploration for deep reinforce-ment learning. In

International Conference on LearningRepresentations , 2019. URL https://openreview.net/forum?id=Byx83s09Km .Oliehoek, F. and Amato, C. A concise introductionto decentralized pomdps. 01 2016. doi: 10.1007/978-3-319-28929-8.Oliehoek, F. A., Amato, C., et al.

A concise introduction todecentralized POMDPs , volume 1. Springer, 2016.Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G.,Foerster, J., and Whiteson, S. Monotonic value functionfactorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:2003.08839 , 2020.Rowland, M., Dadashi, R., Kumar, S., Munos, R., Belle-mare, M. G., and Dabney, W. Statistics and samplesin distributional reinforcement learning. arXiv preprintarXiv:1902.08102 , 2019.Samvelyan, M., Rashid, T., Schroeder de Witt, C., Far-quhar, G., Nardelli, N., Rudner, T. G., Hung, C.-M., Torr,P. H., Foerster, J., and Whiteson, S. The starcraft multi-agent challenge. In

Proceedings of the 18th InternationalConference on Autonomous Agents and MultiAgent Sys-tems , pp. 2186–2188. International Foundation for Au-tonomous Agents and Multiagent Systems, 2019.Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universalvalue function approximators. In

International conferenceon machine learning , pp. 1312–1320, 2015.Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi,Y. Qtran: Learning to factorize with transformation forcooperative multi-agent reinforcement learning. arXivpreprint arXiv:1905.05408 , 2019.Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zam-baldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo,J. Z., Tuyls, K., et al. Value-decomposition networks for

FAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning cooperative multi-agent learning based on team reward.In

Proceedings of the 17th international conference onautonomous agents and multiagent systems , pp. 2085–2087. International Foundation for Autonomous Agentsand Multiagent Systems, 2018.Tan, M. Multi-agent reinforcement learning: Independentversus cooperative agents. In

Proceedings of the TenthInternational Conference on International Conferenceon Machine Learning , ICML’93, pp. 330–337, San Fran-cisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.ISBN 1558603077.Wang, T., Wang, J., Zheng, C., and Zhang, C. Learningnearly decomposable value functions via communicationminimization. arXiv preprint arXiv:1910.05366 , 2019.Wang, T., Dong, H., Lesser, V., and Zhang, C. Multi-agent reinforcement learning with emergent roles. arXivpreprint arXiv:2003.08039 , 2020.Yang, D., Zhao, L., Lin, Z., Qin, T., Bian, J., and Liu, T.-Y.Fully parameterized quantile function for distributionalreinforcement learning. In

Advances in Neural Informa-tion Processing Systems , pp. 6190–6199, 2019.Zhang, S. and Yao, H. Quota: The quantile option archi-tecture for reinforcement learning. In

Proceedings of theAAAI Conference on Artiﬁcial Intelligence , volume 33,pp. 5797–5804, 2019.Zhang, S. Q., Zhang, Q., and Lin, J. Efﬁcient communica-tion in multi-agent reinforcement learning via variancebased control. In