Reward Machines for Cooperative Multi-Agent Reinforcement Learning
RReward Machines for Cooperative Multi-AgentReinforcement Learning
Cyrus Neary ∗ Zhe Xu ∗ Bo Wu ∗ Ufuk Topcu ∗ †
Abstract
In cooperative multi-agent reinforcement learning, a collection of agents learnsto interact in a shared environment to achieve a common goal. We propose theuse of reward machines (RM) — Mealy machines used as structured represen-tations of reward functions — to encode the team’s task. The proposed novelinterpretation of RMs in the multi-agent setting explicitly encodes required team-mate interdependencies and independencies, allowing the team-level task to bedecomposed into sub-tasks for individual agents. We define such a notion of RMdecomposition and present algorithmically verifiable conditions guaranteeing thatdistributed completion of the sub-tasks leads to team behavior accomplishing theoriginal task. This framework for task decomposition provides a natural approachto decentralized learning: agents may learn to accomplish their sub-tasks whileobserving only their local state and abstracted representations of their teammates.We accordingly propose a decentralized q-learning algorithm. Furthermore, in thecase of undiscounted rewards, we use local value functions to derive lower andupper bounds for the global value function corresponding to the team task. Experi-mental results in three discrete settings exemplify the effectiveness of the proposedRM decomposition approach, which converges to a successful team policy twoorders of magnitude faster than a centralized learner and significantly outperformshierarchical and independent q-learning approaches.
In multi-agent reinforcement learning (MARL), a collection of agents learn to maximize expectedlong-term return through interactions with each other and with a shared environment. We studyMARL in a cooperative setting: all of the agents are rewarded collectively for achieving a team task.Two challenges inherent to MARL are coordination and non-stationarity. Firstly, the need forcoordination between the agents arises because the correctness of any individual’s actions maydepend on the actions of its teammates [3, 10]. Secondly, the agents are learning and updating theirbehaviors simultaneously. Thus from the point of view of any individual agent, the learning problemis non-stationary; the best solution for any individual is constantly changing [11].A reward machine (RM) is a Mealy machine used to define tasks and behaviors dependent onabstracted descriptions of the environment [14]. Intuitively, RMs allow agents to separate tasks intostages and to learn different sets of behaviors for the different portions of the overall task. In thiswork, we use RMs to describe cooperative tasks and we introduce a notion of RM decompositionfor the MARL problem. The proposed use of RMs explicitly encodes the information available toeach agent, as well as the teammate communications necessary for successful cooperative behavior.The global (cooperative) task can then be decomposed into a collection of new RMs, each encoding ∗ Oden Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin, TX. † Department of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, Austin,TX. Contact: {cneary, bwu3, utopcu}@utexas.edu, [email protected]. Under review. a r X i v : . [ c s . M A ] J u l sub-task for an individual agent. We propose a decentralized learning algorithm that trains theagents individually using these sub-task RMs, effectively reducing the team’s task to a collection ofsingle-agent reinforcement learning problems. The algorithm assumes each agent may only observeits own local state and the information encoded in its sub-task RM.Furthermore, we provide conditions guaranteeing that if each agent accomplishes its sub-task, thecorresponding joint behavior provably accomplishes the original team task. Finally, decompositionof the team’s task allows for each agent to be trained independently of its teammates and thusaddresses the problems posed by non-stationarity. Individual agents learn to condition their actionson abstractions of their teammates, eliminating the need for simultaneous learning.Experimental results in three discrete domains exemplify the strengths of the proposed decentralizedalgorithm. In a two-agent rendezvous task, the proposed algorithm converges to successful teambehavior a factor of 200 times faster than a centralized learner and performs roughly on par withhierarchical independent learners (h-IL) [30]. In a more complicated temporally extended three-agent task, the proposed algorithm quickly learns effective team behavior while neither h-IL norindependent q-learning (IQL) [29] converge to policies completing the task. A Markov decision process (MDP) is a tuple M = (cid:104) S, A, r, p, γ (cid:105) consisting of a finite set of states S , a finite set of actions A , a reward function r : S × A × S → R , a transition probability function p : S × A → ∆( S ) , and a discount factor γ ∈ (0 , . Here ∆( S ) is the set of all probabilitydistributions over S . We denote by p ( s (cid:48) | s, a ) the probability of transitioning to state s (cid:48) from state s under action a . A stationary policy π : S → ∆( A ) maps states to distributions over actions.The goal of reinforcement learning (RL) is to learn an optimal policy π ∗ maximizing expectedfuture discounted reward from any state [28]. The q-function for policy π is defined as the expecteddiscounted future reward that results from taking action a from state s and following policy π thereafter. Tabular q-learning [33], an RL algorithm, uses the experience { ( s t , a t , r t , s t +1 ) } t ∈ N ofan agent interacting with an MDP to learn the q-function q ∗ ( s, a ) corresponding to an optimal policy π ∗ . Given q ∗ ( s, a ) , the optimal policy may be recovered.A common framework used to extend RL to a multi-agent setting is the Markov game [21, 3]. Ateam Markov game of N agents is a tuple G = (cid:104) S , ..., S N , A , ..., A N , p, R, γ (cid:105) . S i and A i arethe finite sets of agent i ’s local states and actions respectively. We define the set of joint statesas S = S × ... × S N and we similarly define the set of joint actions to be A = A × ... × A N . p : S × A → ∆( S ) is a joint state transition probability distribution. R : S × A × S → R is theteam’s collective reward function which is shared by all agents, and γ ∈ (0 , is a discount factor.In this work, we assume the dynamics of each agent are independently governed by local transitionprobability functions p i : S i × A i → ∆( S i ) . The joint transition function is then constructed as p ( s (cid:48) | s , a ) = Π Ni =1 p i ( s (cid:48) i | s i , a i ) , for all s , s (cid:48) ∈ S and for all a ∈ A . A team policy is defined as π : S → ∆( A ) . Analogous to the single agent case, the objective of team MARL is to find a teampolicy π ∗ that maximizes the expected discounted future reward from any joint state s ∈ S . To introduce reward machines (RM) and to illustrate how they may be used to encode a team’s task,we consider the example shown in Figure 1a. Three agents, denoted A , A , and A , operate in ashared environment with the objective of allowing A to reach the goal location denoted G . However,the red, yellow, and green colored areas are unsafe for agents A , A , and A respectively; if oneof the agents moves into their colored region before the button of the corresponding color has beenpressed, the team fails the task. Furthermore, the yellow and green buttons may be pressed by anindividual agent, but the red button requires two agents to simultaneously occupy the button’s locationbefore it is activated. The dashed and numbered arrows in the figure illustrate the sequence of eventsnecessary for task completion: A should push the yellow button allowing A to proceed to the greenbutton, which is necessary for A to join A in pressing the red button, finally allowing A to crossthe red region and reach G . 2 a) Cooperative buttons domain. (b) Reward machine encoding the cooperative buttons task. Figure 1: The multi-agent buttons task. In Figure (a), the colored circles denote the locations of thebuttons, the thick black areas are walls the agents cannot cross, and the numbered dotted lines showthe order of high-level steps necessary to complete the task. The set of events of the RM in (b) is
Σ = { , \ , , \ , , , , , , , G } . A reward machine (RM) R = (cid:104) U, u I , Σ , δ, σ (cid:105) consists of a finite, nonempty set U of states, an initial state u I ∈ U , a finite set Σ of environment events, a transition function δ : U × Σ → U , and an output function σ : U × U → R . Reward machines are a type of Mealy machine used to define temporally extended tasks and behaviors[14]. Figure 1b illustrates the RM encoding the buttons task.Set Σ is the collection of all the high-level events necessary to describe the team’s task. For example, ∈ Σ corresponds to the event that A has moved into the unsafe red region, and ∈ Σ correspondsto the event of the red button being pressed. Because both A and A must simultaneously pressthe red button for the event to occur, we additionally include the events , \ , , \ in Σ , whichrepresent A or A individually either pressing or not pressing the red button.The states u ∈ U of the RM represent different stages of the team’s task. Transitions between RMstates are triggered by events from Σ . For example, the buttons task starts in state u I . From this state,if one of the dangerous colored regions are entered, the corresponding event , , or will cause atransition to state u from which there are no outgoing transitions; the team has failed the task, so itdoes not progress from this state. If instead A presses the yellow button, then the event will causethe RM to transition to state u ; A may now safely proceed across the yellow region. In this way,transitions in the RM represent progress through the task.Output function σ assigns reward values to these transitions. In this work, we restrict ourselves to task completion RMs, similar to those studied in [34]; σ should only reward transitions that result inthe immediate completion of the task. To formalize this idea, we define a subset F ⊆ U of rewardstates. If the RM is in a state belonging to F , it means the task is complete. The output functionis then defined such that σ ( u, u (cid:48) ) = 1 if u / ∈ F and u (cid:48) ∈ F , and is defined to output otherwise.Furthermore, there should be no outgoing transitions from states in F . In Figure b , F = { u } . Ifthe event G ∈ Σ occurs while the RM is in state u the task has been successfully completed; theRM will transition to state u and return reward 1.A run of RM R on the sequence of events e e ...e k ∈ Σ ∗ is a sequence u e u e ...u k e k u k +1 , where u = u I and u t +1 = δ ( u t , e t ) . If u k +1 ∈ F , then σ ( u k , u k +1 ) = 1 . In this case we say that the eventsequence e ...e k completes the task described by R , and we denote this statement R ( e ...e k ) = 1 .Otherwise, R ( e ...e k ) = 0 . For example, R ( G ) = 1 , but R ( ) = 0 . RMs may be applied to the RL setting by using them to replace the reward function in an MDP.However, RMs describe tasks in terms of abstract events. To allow an RM to interface with theunderlying environment, we define a labeling function L : S × U → Σ , which abstracts the currentenvironment state to sets of high-level events. Note, however, that L also takes the current RM state3 ∈ U as input, allowing the events output by L to depend not only on the environment state, but alsoon the current progress through the task.Q-learning with RMs (QRM) [14] is an algorithm that learns a collection of q-functions, one foreach RM state u ∈ U , corresponding to the optimal policies for each stage of the task. At time step t , the agent uses the RM state u t and its estimate of q u t ( s t , · ) to select action a t . The environmentaccordingly progresses to state s t +1 . The events output by L ( s t +1 , u t ) then cause the RM totransition to state u t +1 , and the corresponding reward output by σ is used to update q u t ( s t , a t ) (seesupplementary materials §8). The QRM algorithm is guaranteed to converge to an optimal policy.A naive approach to applying RMs in the MARL setting would be to treat the entire team as a singleagent and to use QRM to learn a centralized policy. This approach quickly becomes intractable,however, due to the exponential scaling of the number of states and actions with the number of agents.Furthermore, it assumes agents communicate with a central controller at every time step, which maybe undesirable from an implementation standpoint. A decentralized approach to MARL treats the agents as individual decision-makers, and thereforerequires further consideration of the information available to each agent. In this work, we assume the i th agent can observe its own local state s i ∈ S i , but not the local states of its teammates. Given anRM R describing the team’s task and the corresponding event set Σ , we assign the i th agent a subset Σ i ⊆ Σ of events. These events represent the high-level information that is available to the agent. Wecall Σ i the local event set of agent i . We assume that all events represented in Σ belong to the localevent set of at least one of the agents, thus (cid:83) Ni =1 Σ i = Σ .For example, in the three-agent buttons task, the local event set assigned to A is Σ = { , , , G } ; A should avoid the red region, has access to the yellow button, must know when the red button hasbeen pressed, and should eventually proceed to the goal location. Note that, for example, eventsand are not included in Σ because these regions are not dangerous to A , and because A isseparated from these colored regions by the walls in the environment. Similarly, event is not in Σ because for A , it is only an intermediate step in the process of pressing the red button. Similarly, theevent sets of A and A are Σ = { , , , , \ , } and Σ = { , , , \ , } , respectively.Extending the definition of natural projections on automata [16] to reward machines, for each agent i , we define a new RM, R i = (cid:104) U i , u iI , Σ i , δ i , σ i , F i (cid:105) , called the projection of R onto Σ i . A formaldefinition of R i is provided in the supplementary materials §9. Intuitively, the projection may beconstructively defined as follows:1. Remove all transitions triggered by events not contained in Σ i .2. To define the projected states U i , merge all states connected by a removed transition.3. The remaining transitions are used to define the projected transition function δ i .4. The reward states F i are defined as the collections of merged states containing at least onereward state u ∈ F from the original RM. Output function σ i is defined accordingly.Figures 2a, 2b, and 2c show the results of projecting the task RM from Figure 1b onto the lo-cal event sets of Σ , Σ , and Σ respectively. As an example we specifically examine R = (cid:104) U , u I , Σ , δ , σ , F (cid:105) , illustrated in in Figure 2a. Because events , , , , \ , , and \ arenot elements of Σ , any states connected by these events are merged to form the projected states U .For example, states u , u , u , u , u ∈ U which comprise the diamond structure in Figure 1b, areall merged into projected state u ∈ U . Intuitively, this portion of the team’s RM R encodes thenecessary coordination between A and A to press the red button, which is irrelevant to A ’s portionof the task and is thus represented as a single state in R . Note that R describes A ’s contributionto the team’s task; press the yellow button then wait for the red button to be pressed while avoidingthe red region, before proceeding to the goal location. Intuitively, this high-level behavior is correctwith respect to the team task, regardless of the behavior A or A .Consider some finite event sequence ξ = e ...e k ∈ Σ ∗ . The projection of ξ onto Σ i , denoted ξ i ∈ Σ ∗ i ,is obtained by removing all events from ξ that are not in Σ i (see supplementary material §9). ξ i maybe thought of as the event sequence ξ from the point of view of the i th agent.Theorem 1 defines a condition guaranteeing that the composition of the individual behaviors describedby the projected RMs is equivalent to the behavior described by the original team RM. This condition4 a) Σ = { , , , G } (b) Σ = { , , , , \ , } (c) Σ = { , , , \ , } Figure 2: Projections of the team RM, illustrated in Figure 1b, onto the local event sets Σ , Σ , Σ .uses the notions of bisimilarity and parallel composition, common concepts for finite transitionsystems [1], that are formally defined for RMs in supplementary materials §9. Theorem 1.
Given RM R and a collection of local event sets Σ , Σ , ..., Σ N such that (cid:83) Ni =1 Σ i = Σ ,let R , R , ... R N be the corresponding collection of projected RMs. Suppose R is bisimilar to theparallel composition of R , R , ... R N . Then given an event sequence ξ ∈ Σ ∗ , R ( ξ ) = 1 if and onlyif R i ( ξ i ) = 1 for all i = 1 , , ..., N . Otherwise, R ( ξ ) = 0 and R i ( ξ i ) = 0 for all i = 1 , , ...N . We note that [16, 17] present conditions, in terms of R and Σ , Σ , ... Σ N , which may be appliedto check whether R is bisimilar to the parallel composition of its projections R , R , ..., R N .Alternatively, one may computationally check whether this result holds by constructing the parallelcomposition of the projected RMs and applying the Hopcroft-Karp algorithm to check equivalencewith R [13, 2]. Inspired by Theorem 1, we propose a distributed approach to learning a decentralized policy. Ouridea is to use the projected RMs to define a collection of single-agent RL tasks, and to train eachagent on their respective task using the QRM algorithm described in §3.2.For clarity, we wish to train the agents using their projected RMs in an individual setting : the agentstake actions in the environment in the absence of their teammates. However, the policies they learnshould result in a team policy that is successful in the team setting , in which the agents interactsimultaneously with the shared environment.
Projected reward machine R i defines the task of the i th agent in terms of high-level events from Σ i .As discussed in §3.2, to connect R i with the underlying environment, a labeling function is requiredto define the environment states that cause the events in Σ i to occur. In the team setting, we canintuitively define a labeling function L : S × U → Σ mapping team states s ∈ S and RM states u ∈ U from R to sets of events. For example, L ( s , u ) = { } if s is such that A and A are pressingthe red button and u = u .To use R i in the individual setting however, we must first define a local labeling function L i : S i × A i → Σ i mapping the local states s i ∈ S i and projected RM states u i ∈ U i to sets of events in Σ i . Operating under the assumption that only one event can occur per agent per time step, we requirethat, for any local state pair ( s i , u i ) , L i ( s i , u i ) returns only a single event from Σ i . Furthermore, thelocal labeling functions L , L ,..., L N should be defined such that they always collectively outputthe same set of events as L , when being used in the team setting.For a given event e ∈ Σ , we define the set I e = { i | e ∈ Σ i } as the collaborating agents on e . Definition 2. (Decomposable labeling function) A labeling function L : S × U → Σ is considereddecomposable with respect to local event sets Σ , Σ , ..., Σ N if there exists a collection of locallabeling functions L , L , ..., L N with L i : S i × U i → Σ i such that L ( s , u ) outputs event e if andonly if L i ( s i , u i ) outputs event e for every i in I e , the set of collaborating agents on event e . Here, s i s the i th component of the team’s joint state s , and u i ∈ U i is the state of RM R i containing state u ∈ U from RM R (recall that states in U i correspond to collections of states from U ). Note that L will be decomposable if we can constructively define L , ..., L N to satisfy the conditionin Definition 2. Following this idea, we construct L i from L as follows: L i (¯ s i , u i ) outputs event e ∈ Σ i whenever there exists a possible configuration of agent i ’s teammates s = ( s , ..., ¯ s i , ..., s N ) such that L ( s , u ) outputs e , here u ∈ U is any state belonging to u i ∈ U i . Our interpretation of thisdefinition of L i is as follows. While L ( s , u ) outputs the events that occur when the team is in jointstate s and RM state u , the local labeling function L i : S i × A i → Σ i outputs the events in Σ i that could be occurring from the point of view of an agent who knows L , but may only observe part ofthe function’s input: s i ∈ S i and u i ∈ U i . A formal definition of L i as well as conditions on L thatensure L i is well defined are given in supplementary material §10.We say event e ∈ Σ is a shared event if it belongs to the local event sets of multiple agents, i.e., if | I e | > . In the buttons task, ∈ Σ ∩ Σ is an example of a shared event. Suppose A and A usethe events output by L and L , respectively, to update R and R , while interacting in the teamsetting. Because Σ and Σ both include the event , the agents must syncrhonize on this event:should simultaneously cause transitions in both projected RMs, or it should cause a transition inneither of them. In practice synchronization on shared events is implemented as follows: If L i returnsa shared event e , the i th agent should check with all teammates in I e whether their local labelingfunctions also returned e , before using the event to update R i . Event synchronization corresponds tothe agents communicating and collectively acknowledging that the shared event has occurred.Given a sequence of joint states s s ... s k in the team setting, we may use L and R to uniquely definea corresponding sequence L ( s o ... s k ) ∈ Σ ∗ of events. Similarly, given the corresponding collectionof sequences { s i ...s ik } Ni =1 of local states, local labeling functions L , ..., L N , and assuming that theagents synchronize on shared events, we may define the corresponding sequences L i ( s i ...s ik ) ∈ Σ ∗ i of events for every i = 1 , , ..., N (see supplementary material §10). Theorem 2.
Given R , L , and Σ , ..., Σ N , suppose the bisimilarity condition from Theorem 1holds. Furthermore, assume L is decomposable with respect to Σ , ... Σ N with the correspondinglocal labeling functions L , ..., L N . Let s ... s k be a sequence of joint environment states and { s i ...s ik } Ni =1 be the corresponding sequences of local states. If the agents synchronize on sharedevents, then R ( L ( s ... s k )) = 1 if and only if R i ( L i ( s i ...s ik )) = 1 for all i = 1 , , ..., N . Otherwise R ( L ( s ... s k )) = 0 and R i ( L i ( s i ...s ik )) = 0 for all i = 1 , , ..., N . Let V π ( s ) denote the expected future undiscounted reward returned by R , given the team followsjoint policy π from joint environment state s ∈ S and initial RM state u I . Similarly, let V π i ( s ) denotethe expected future reward returned by R i , given the team follows the same policy from state s andprojected initial RM state u iI . Using the Fréchet conjunction inequality, we provide the followingupper and lower bounds on the value function corresponding to R , in terms of the value functionscorresponding to projected RMs R i . Theorem 3.
Suppose the conditions in Theorem 2 are satisfied, then max { , V π ( s ) + V π ( s ) + ... + V π N ( s ) − ( N − } ≤ V π ( s ) ≤ min { V π ( s ) , V π ( s ) , ..., V π N ( s ) } . Theorem 2 tells us that it makes no difference whether we use RM R and team labeling function L ,or projected RMs R ,..., R N and local labeling functions L , ..., L N to describe the team task. Byreplacing R and L with R , ..., R N and L , ..., L N however, we note that the only interactions eachagent has with its teammates are synchronizations on shared events.This key insight provides a method to train the agents separately from their teammates. We traineach agent in an individual setting, isolated from its teammates, using rewards returned from R i andevents returned from L i . Whenever L i outputs what would normally be a shared event in the teamsetting, we simulate a synchronization signal after some random amount of time.During training, each agent individually performs q-learning to find an optimal policy for the sub-taskdescribed by its projected RM, similarly to as described in §3.2. The i th agent learns a collection ofq-functions Q i = { q u i | u i ∈ U i } such that each q-function q u i : S i × A i → R corresponds to theagent’s optimal policy while it is in projected RM state u i .6o evaluate the learned policies, we test the team by allowing the agents to interact in the teamenvironment and evaluate the team’s performance using team task RM R . Each agent tracks its owntask progress using its projected RM R i and follows the policy it learned during training. For furtherdetails, see supplementary material §12. In this section, we provide empirical evaluations of DQPRM in three task domains. The buttons taskis as described in §3. We additionally consider two-agent, and three-agent rendezvous tasks in whicheach agent must simultaneously occupy a specific rendezvous location before individually navigatingto separate goal locations.We compare DQPRM’s performance against three baseline algorithms: the centralized QRM (CQRM)algorithm described in §3.2, independent q-learners (IQL) [29], and hierarchical independent learners(h-IL) [30]. Because of the non-Markovian nature of the tasks, we provide both the IQL and h-ILagents with additional memory states. In the buttons task, the memory states encode which buttonshave already been pressed. In the rendezvous task, the memory state encodes whether or not the teamhas successfully completed the rendezvous.Each IQL agent learns a q-function mapping augmented state-action pairs to values. That is, the i th agent learns a q-function q i : S i × S M × A i → R , where S i , A i are the local states and actions ofthe agent and S M is the finite set of memory states, commonly observed by the entire team.Our implementation of h-IL is inspired by the learning structure outlined in [30]. Each agent usestabular q-learning to learn a meta-policy — which uses the current memory state to select a high-leveloption — as well as a collection of low-level policies — which implement those options in theenvironment. The available options correspond to the high-level tasks available to each agent. Forexample, A in the buttons task is provided with the following three options: remain in a safe(non-colored) region, navigate to the yellow button, and navigate to the goal location.In all algorithms, we use a discount factor γ = 0 . , a learning rate α = 0 . , and an explorationparameter that is annealed from (cid:15) = 0 . to zero as training progresses. In the training of the DQPRMagents, if an agent observes a shared event, then with probability . , it is provided with a signalsimulating successful synchronization with all collaborators.All tasks are implemented in a 10x10 gridworld and all agents have the following 5 available actions:move right, move left, move up, move down, or don’t move. If an agent moves in any direction, thenwith a chance the agent will instead slip to an adjacent state. Each training episode lasts 1,000time steps. We perform periodic testing episodes in which the agents exploit the policies they havelearned and team’s performance is recorded.Figure 3 shows the experimental results for each algorithm over 10 separate runs per domain. Thefigures plot the median number of steps required to complete the team task against the number of (a) Two-agent rendezvous task. (b) Three-agent rendezvous task. (c) Buttons task. Figure 3: Algorithm performance on various tasks. The y-axis show the number of steps requiredfor the learned policies to complete the task. The x-axis shows the number of elapsed training steps.Note that the scale in (a) is logarithmic, whereas it is linear in (b), (c).7lapsed training steps. The shaded regions enclose the th and th percentiles. We note that theCQRM baseline is only tested in the two-agent scenario because this centralized approach requiresexcessive amounts of memory for the three-agent tasks; storing a centralized policy for three-agentsrequires approximately two billion separate values.While the h-IL baseline marginally outperforms the proposed DQPRM algorithm in the two-agentrendezvous task, in the more complex tasks involving three agents, DQPRM significantly outperformsall baseline methods. The key advantage of the DQPRM algorithm, and what allows it to scale wellwith task complexity and the number of agents, is that it trains the agents separately using theirrespective single-agent task projections. This removes the problem of non-stationarity and also allowsthe agents to more frequently receive reward during training. This is especially beneficial to thetypes of tasks we study, which have sparse and delayed feedback. We further discuss the differencesbetween DQPRM and hierarchical approaches to multi-agent learning in §6. Task decomposition in multi-agent systems has been studied from a planning and cooperative controlperspective [16, 17, 18, 7]. These works examine the conditions in which group tasks describedby automata may be broken into sub-tasks executable by individuals. [8, 6] provide methods tosynthesize control policies for large-scale multi-agent systems with temporal logic specifications.However, all of these works assume a known model of the environment, differing from the learningsetting of this paper.The MARL literature is rich [35, 11, 12, 23]. A popular approach to MARL is IQL [29], in which eachagent learns independently and treats its teammates as part of the environment. [10, 19, 31, 27, 25, 26]decompose cooperative team tasks by factoring the joint q-function into components. Our work,which examines cooperative tasks that have sparse and temporally delayed rewards, is most closelyrelated to hierarchical approaches to MARL. In particular, [22, 9] use task hierarchies to decomposethe multi-agent problem. By learning cooperative strategies only in terms of the sub-tasks at thehighest levels of the hierarchy, agents learn to coordinate much more efficiently than if they weresharing information at the level of primitive state-action pairs. More recently, [30] empiricallydemonstrates the effectiveness of a deep hierarchical approach to certain cooperative MARL tasks. Akey difference between task hierarchies and RMs, is that RMs explicitly encode the temporal orderingof the high-level sub-tasks. It is by taking advantage of this information that we are able to break ateam’s task into components, and to train the agents independently while guaranteeing that they arelearning behavior appropriate for the original problem. Conversely, in a hierarchical approach, theagents must still learn to coordinate at the level of sub-tasks. Thus, the learning problem remainsinherently multi-agent, albeit simplified.In this work, we assume the task RM is known, and present a method to use its decomposition toefficiently solve the MARL problem. The authors of [34, 15] demonstrate that, in the single-agentRL setting, RMs can instead be learned from experience, removing the assumption of the RMbeing known a priori by the learner. This presents an interesting direction for future research: howmay agents learn, in a multi-agent setting, RMs encoding either the team’s task or projected localtasks. Furthermore, [14, 4] demonstrate in the single-agent RL setting that RMs may be applied tocontinuous environments by replacing tabular q-learning with double deep q networks [32]. We notethat this extension to more complex environments also readily applies to our work, which decomposesmulti-agent problems into collections of RMs describing single-agent tasks.
In this work, we propose a reward machine (RM) based task representation for cooperative multi-agent reinforcement learning (MARL). The representation allows for a team’s task to be decomposedinto sub-tasks for individual agents. We accordingly propose a decentralized q-learning algorithmthat effectively reduces the MARL problem to a collection of single-agent problems. Experimentalresults demonstrate the efficiency and scalability of the proposed algorithm, which learns successfulteam policies for temporally extended cooperative tasks that cannot be solved by the baselinealgorithms used for comparison. This work demonstrates how well-suited RMs are to abstraction anddecomposition of MARL problems, and opens interesting directions for future research.8 eferences [1] Christel Baier and Joost-Pieter Katoen.
Principles of model checking . MIT press, 2008.[2] Filippo Bonchi and Damien Pous. Checking nfa equivalence with bisimulations up to congru-ence.
ACM SIGPLAN Notices , 48(1):457–468, 2013.[3] Craig Boutilier. Planning, learning and coordination in multiagent decision processes. In
Proceedings of the 6th conference on Theoretical aspects of rationality and knowledge , pages195–210. Morgan Kaufmann Publishers Inc., 1996.[4] Alberto Camacho, R Toro Icarte, Toryn Q Klassen, Richard Valenzano, and Sheila A McIlraith.Ltl and beyond: Formal languages for reward function specification in reinforcement learning.In
Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages6065–6073, 2019.[5] Christos G Cassandras and Stephane Lafortune.
Introduction to discrete event systems . SpringerScience & Business Media, 2009.[6] Murat Cubuktepe, Zhe Xu, and Ufuk Topcu. Policy synthesis for factored mdps with graph tem-poral logic specifications. In
Proceedings of the 19th International Conference on AutonomousAgents and MultiAgent Systems , pages 267–275, 2020.[7] Jin Dai and Hai Lin. Automatic synthesis of cooperative multi-agent systems. In , pages 6173–6178. IEEE, 2014.[8] Franck Djeumou, Zhe Xu, and Ufuk Topcu. Probabilistic swarm guidance with graph temporallogic specifications. In
Proceedings of Robotics: Science and Systems XVI , 2020.[9] Mohammad Ghavamzadeh, Sridhar Mahadevan, and Rajbala Makar. Hierarchical multi-agentreinforcement learning.
Autonomous Agents and Multi-Agent Systems , 13(2):197–229, 2006.[10] Carlos Guestrin, Michail Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In
International Conference on Machine Learning , volume 2, pages 227–234. Citeseer, 2002.[11] Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. Asurvey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprintarXiv:1707.09183 , 2017.[12] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. A survey and critique of multiagentdeep reinforcement learning.
Autonomous Agents and Multi-Agent Systems , 33(6):750–797,2019.[13] John E Hopcroft.
A linear algorithm for testing equivalence of finite automata , volume 114.Defense Technical Information Center, 1971.[14] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using rewardmachines for high-level task specification and decomposition in reinforcement learning. In
International Conference on Machine Learning , pages 2112–2121, 2018.[15] Rodrigo Toro Icarte, Ethan Waldie, Toryn Klassen, Rick Valenzano, Margarita Castro, andSheila McIlraith. Learning reward machines for partially observable reinforcement learning. In
Advances in Neural Information Processing Systems , pages 15497–15508, 2019.[16] Mohammad Karimadini and Hai Lin. Guaranteed global performance through local coordina-tions.
Automatica , 47(5):890–898, 2011.[17] Mohammad Karimadini, Hai Lin, and Ali Karimoddini. Cooperative tasking for deterministicspecification automata.
Asian Journal of Control , 18(6):2078–2087, 2016.[18] Ali Karimoddini, Mohammad Karimadini, and Hai Lin. Decentralized hybrid formation controlof unmanned aerial vehicles. In
American Control Conference , pages 3887–3892. IEEE, 2014.[19] Jelle R Kok and Nikos Vlassis. Using the max-plus algorithm for multiagent decision makingin coordination graphs. In
Robot Soccer World Cup , pages 1–12. Springer, 2005.[20] Ratnesh Kumar and Vijay K Garg.
Modeling and control of logical discrete event systems ,volume 300. Springer Science & Business Media, 2012.[21] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In
Machine learning proceedings 1994 , pages 157–163. Elsevier, 1994.922] Rajbala Makar, Sridhar Mahadevan, and Mohammad Ghavamzadeh. Hierarchical multi-agentreinforcement learning. In
Proceedings of the 5th International Conference on AutonomousAgents , pages 246–253, 2001.[23] Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Independent reinforce-ment learners in cooperative markov games: a survey regarding coordination problems.
TheKnowledge Engineering Review , 27(1):1–31, 2012.[24] Rémi Morin. Decompositions of asynchronous systems. In
International Conference onConcurrency Theory , pages 549–564. Springer, 1998.[25] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, JakobFoerster, and Shimon Whiteson. Qmix: monotonic value function factorisation for deepmulti-agent reinforcement learning. arXiv preprint arXiv:1803.11485 , 2018.[26] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran:Learning to factorize with transformation for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:1905.05408 , 2019.[27] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi,Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In
Pro-ceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems ,pages 2085–2087. International Foundation for Autonomous Agents and Multiagent Systems,2018.[28] Richard S Sutton and Andrew G Barto.
Reinforcement learning: An introduction . MIT press,2018.[29] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In
Proceed-ings of the 10th International Conference on Machine Learning , pages 330–337, 1993.[30] Hongyao Tang, Jianye Hao, Tangjie Lv, Yingfeng Chen, Zongzhang Zhang, Hangtian Jia,Chunxu Ren, Yan Zheng, Changjie Fan, and Li Wang. Hierarchical deep multiagent reinforce-ment learning. arXiv preprint arXiv:1809.09332 , 2018.[31] Elise Van der Pol and Frans A Oliehoek. Coordinated deep reinforcement learners for trafficlight control.
Proceedings of Learning, Inference and Control of Multi-Agent Systems , 2016.[32] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with doubleq-learning. In
Proceedings of the 13th AAAI Conference on Artificial Intelligence , 2016.[33] Christopher JCH Watkins and Peter Dayan. Q-learning.
Machine learning , 8(3-4):279–292,1992.[34] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu.Joint inference of reward machines and policies for reinforcement learning. In
Proceedingsof the International Conference on Automated Planning and Scheduling , volume 30, pages590–598, 2020.[35] Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸sar. Multi-agent reinforcement learning: Aselective overview of theories and algorithms. arXiv preprint arXiv:1911.10635 , 2019.10 eward Machines for Cooperative Multi-AgentReinforcement Learning: Supplementary Material
Algorithm 1 shows pseudocode for the QRM algorithm, originally introduced in [14]. The algorithmlearns a collection Q = { q u | S × A → R | u ∈ U } of q-functions, each corresponding to a separatepolicy to follow when in RM state u ∈ U .At each time step t , the agent takes action a t and the MDP progresses to the next state s t +1 . The agentuses its current RM state to abstract its new environment state through the labeling function, obtaining L ( s t +1 , u t ) ∈ Σ . Each event in L ( s t +1 , u t ) causes an RM transition and a resulting reward output.Note that for a single environment step, the reward machine may transition any number of timescorresponding to the number of events that occurred simultaneously at time t . Algorithm 1:
Q-Learning with Reward Machines
Input: R = (cid:104) U, u I , Σ , δ, σ, F (cid:105) , L , γ , α Output: Q = { q u : S × A → R | u ∈ U } Q ← InitializeQF unctions () for n = 1 to N umEpisodes do u ← u I , s ← environmentInitialState () for t = 0 to N umSteps do a ← getAction ( q u , s ) s ← executeAction ( s , a ) u temp ← u , r ← for e ∈ L ( s , u ) do u ← δ ( u temp , e ) r ← r + σ ( u temp , u ) u temp ← u q u ( s , a ) ← (1 − α ) q u ( s , a ) + α ( r + γ max a (cid:48) ∈ A q u ( s , a (cid:48) )) u ← u , s ← s if u ∈ F then break return Q Let Σ ∗ be the set of all finite strings, including the empty string (cid:15) , over elements from Σ . Any string ξ = e e ...e k ∈ Σ ∗ thus represents a sequence of environment events. Following the notation of [20],we extend the domain of definition of the transition function δ to U × Σ ∗ . Definition 3. (RM transition on a sequence of events) For a reward machine R and sequence ofevents ξ ∈ Σ ∗ , we define the transition over sequence ξ from state u recursively with relations u = δ ( u, ε ) and δ ( u, ξe ) = δ ( δ ( u, ξ ) , e ) , ∀ ξ ∈ Σ ∗ , e ∈ Σ . In words, δ ( u, ξ ) ∈ U is the RM state that the sequence of events ξ will cause the RM to transition to,given it begins in state u . Given our definition of the output function σ for a task RM, we note thatthe RM will return 1 if and only if ξ causes the RM to transition to a final state u ∈ F . i.e. R ( ξ ) = 1 if and only if u / ∈ F and δ ( u, ξ ) ∈ F , otherwise, R ( ξ ) = 0 . To define the projection of an RM R onto a local event set Σ i ⊆ Σ , we begin by defining a notion ofstate equivalence under local event set Σ i [24]. 11 efinition 4. (Equivalence class of states) Given a reward machine R = ( U, u I , Σ , δ, σ, F ) , and alocal event set Σ i ⊆ Σ we define equivalence relation R Σ i as the minimal equivalence relation suchthat for all u , u ∈ U , e ∈ Σ , if u = δ ( u , e ) and e / ∈ Σ i , then ( u , u ) ∈ R Σ i . The equivalenceclass of any state u ∈ U under equivalence relation R Σ i is [ u ] Σ i = { u (cid:48) ∈ U | ( u, u (cid:48) ) ∈ R Σ i } . Thequotient set of U by R Σ i is defined as the set of all equivalence classes U/R Σ i = { [ u ] Σ i | u ∈ U } . In words, two states u , u ∈ U of the RM R are members of the same equivalence class if atransition (or a sequence of transitions) exists between them that are triggered by events outside ofthe local event set Σ i . These equivalence classes represent the collections of states of R that areindistinguishable to an agent who may only observe events from Σ i . Now, we use the equivalenceclasses of states [ u ] Σ i to define the RM projection onto Σ i . Definition 5. (RM projection onto a local event set) Given a reward machine R = ( U, u I , Σ , δ, σ, F ) and a local event set Σ i ⊆ Σ , we define the projection of R onto Σ i as R i = ( U i , u iI , Σ i , δ i , σ i , F i ) . • The set of projected states U i is given by U/R Σ i ; each state u i ∈ U i is an equivalence classof states from u ∈ U . • The initial state is u iI = [ u I ] Σ i . • Transition function δ i : U i × Σ i → U i is defined such that u i ∈ δ i ( u i , e ) if and only ifthere exist u , u ∈ U such that u i = [ u ] Σ i , u i = [ u ] Σ i , and u = δ ( u , e ) . • The projected set of final states is defined as F i = { u i ∈ U i |∃ u ∈ F such that u i = [ u ] Σ i } . • The output function σ i : U i × U i → R is defined such that σ i ( u i , u i ) = 1 if u i / ∈ F i , u i ∈ F i and σ ( u i , u i ) = 0 otherwise. We note that δ i is not necessarily a deterministic transition function. It’s possible for δ i to map somestate-event pairs to sets of states instead of to individual states. However, [16] show that R is bisimilarto the parallel composition of its projections R , ... R N (the condition in Theorem 1) if and only iffor every i = 1 , , ..., N , R i is bisimilar to an RM that does have a deterministic transition function.So, in this work we assume that R i is always constructed such that δ i is deterministic. Practically,this may be achieved by merging any states u i and u i if there exists a state u i and an event e ∈ Σ i such that δ i ( u i , e ) = { u i , u i } .Now we formally define the projection of a sequence of events ξ ∈ Σ ∗ onto a local event set Σ i ⊆ Σ . Definition 6. (Projection of string of events onto local event subset) Given event set Σ and a localevent set Σ i ⊆ Σ , the projection P Σ i : Σ ∗ → Σ ∗ i is defined inductively by the relations P Σ i ( ε ) = ε and ∀ ξ ∈ Σ ∗ , ∀ e ∈ Σ : P Σ i ( ξe ) = (cid:26) P Σ i ( ξ ) e if e ∈ Σ i P Σ i ( ξ ) Otherwise
In definition 6 recall that ε denotes the empty sequence. (Parallel composition of RMs) The parallel composition of two reward machines R i = (cid:104) U i , u iI , Σ i , δ i , σ i , F i (cid:105) , i = 1 , is defined as R (cid:107) R = (cid:104) U, u I , Σ , δ, σ, F (cid:105) where • The set of states is defined as U = U × U . • The initial state is u I = ( u I , u I ) . • The set of events is defined as
Σ = Σ ∪ Σ . • δ is defined such that for every ( u , u ) ∈ U × U and for every event e ∈ Σ , δ (( u , u ) , e ) = ( δ ( u , e ) , δ ( u , e )) , if δ ( u , e ) , δ ( u , e ) defined , e ∈ Σ ∩ Σ ( δ ( u , e ) , u ) , if δ ( u , e ) defined , e ∈ Σ \ Σ ( u , δ ( u , e )) , if δ ( u , e ) defined , e ∈ Σ \ Σ undefined , otherwise • The set of final states is defined as F = F × F • The output function σ : U × U → R is defined such that σ ( u, u (cid:48) ) = 1 if u / ∈ F , u ∈ F and σ = 0 otherwise.By (cid:107) Ni =1 R i we denote the parallel composition of a collection of RMs R , R , ..., R N . Theparallel composition of more than two RMs is defined using the associative property of the parallelcomposition operator (cid:107) Ni =1 R i = R (cid:107) ( R (cid:107) ( ... (cid:107) ( R N − (cid:107) R N ))) [5]. efinition 8. (Bisimilarity of Reward Machines) Let R i = (cid:104) U i , u iI , Σ , δ i , σ i , F i (cid:105) , i = 1 , , be twoRMs. R and R are bisimilar, denoted R ∼ = R if there exists a relation R ⊆ U × U with respectto common event set Σ such that1. ( u I , u I ) ∈ R .2. For every ( u , u ) ∈ R , • u ∈ F if and only if u ∈ F • if u (cid:48) ∈ δ ( u , e ) for some e ∈ Σ , then there exists u (cid:48) ∈ U such that u (cid:48) ∈ δ ( u , e ) and ( u (cid:48) , u (cid:48) ) ∈ R . • if u (cid:48) ∈ δ ( u , e ) for some e ∈ Σ , then there exists u (cid:48) ∈ U such that u (cid:48) ∈ δ ( u , e ) and ( u (cid:48) , u (cid:48) ) ∈ R .
10 Local Labeling Functions (Consistent set of team states) Given agent i ’s local state s i ∈ S i and the state u i ∈ U i of its projected RM, we define the set B s i ,u i ⊆ S × U as B s i ,u i = { ( s , u ) | s ∈ S × S × ... × { s i } × ... × S N , u ∈ u i } For clarity, recall that any state u i in the projected RM R i corresponds to a collection of states from U . Thus any pair ( s , u ) ∈ B s i ,u i corresponds to a team environment state s ∈ S such that agent i isin local state s i and to a team RM state u ∈ U belonging to the collection u i ∈ U i . In words, B s i ,u i is the set of all pairs of team environment states and team RM states that are consistent with the localenvironment state s i and projected RM state u i . Definition 10. (Local labeling function) Given a team labeling function L and a local event set Σ i ,we define the local labeling function L i , for all ( s i , u i ) ∈ S i × U i , as L i ( s i , u i ) = (cid:26) L ( s , u ) ∩ Σ i , if ∃ ( s , u ) ∈ B s i ,u i such that L ( s , u ) ∩ Σ i (cid:54) = ∅∅ , if ∀ ( s , u ) ∈ B s i ,u i , L ( s , u ) ∩ Σ i = ∅ . We define the following three conditions to ensure that L i : S i × U i → Σ i is well defined, and thatthe collection L ,..., L N can be used in place of L . These conditions must hold for i = 1 , , ..., N .1. To ensure L i maps to singleton sets of local events, or to the empty set, we must have thatfor any ( s , u ) ∈ S × U , | L ( s , u ) ∩ Σ i | ≤ .2. Given any input ( s i , u i ) , to ensure that L i ( s i , u i ) has a unique output, there must exist aunique e ∈ Σ i such that L ( s , u ) ∩ Σ i = { e } or L ( s , u ) ∩ Σ i = ∅ for every ( s , u ) ∈ B s i ,u i .3. For every event e ∈ Σ , if e / ∈ L ( s , u ) then there must exists some local event set Σ i containing e such that L (˜ s , ˜ u ) ∩ Σ i = ∅ for every (˜ s , ˜ u ) ∈ B s i ,u i . Here s i is the local stateof agent i consistent with team state s , and u i is the state of R i containing u .The first condition ensures that L only returns a single event from the local event set of each agent forany step of the environment. The second condition ensures that given the local state pair ( s i , u i ) ofagent i , the same event singleton { e } ∈ Σ i is returned by L i regardless of the states of the teammatesof agent i . The final condition ensures that e is returned by L only if it is also returned by L i forevery i ∈ I e . For any finite trajectory s s ... s k of team environment states, we may use the reward machine R andlabeling function L to define a sequence of triplets ( s , u , l )( s , u , l ) ... ( s k , u k , l k ) and a stringof events L ( s ... s k ) ∈ Σ ∗ . Here u t is the state of team RM R at time t and l t is the output of thelabeling function L at time t . Algorithm 2 details the constructive definition of the sequence.Similarly, for a collection of local environment states { s i s i ...s ik } Ni =1 and the collections of pro-jected reward machines {R i } Ni =1 and local labeling functions { L i } Ni =1 we may define the collectionof sequences { ( s i , u i , ˜ l i )( s i , u i , ˜ l i ) ... ( s ik , u ik , ˜ l ik ) } Ni =1 as well as a collection of strings of events13 lgorithm 2: Construct sequence of RM states and labeling function outputs.
Input: s s ... s k , R , L Output: ( s , u , l )( s , u , l ) ... ( s k , u k , l k ) , L ( s s ... s k ) u ← u I , l ← ∅ , ξ ← emptyString () for t = 1 to k − do u temp ← u t for e ∈ l t do u temp ← δ ( u temp , e ) ξ ← append ( ξ, e ) u t +1 ← u temp l t +1 ← L ( s t +1 , u t ) L ( s s ... s k ) ← ξ return ( s , u , l )( s , u , l ) ... ( s k , u k , l k ) , L ( s s ... s k ) { L i ( s i ...s ik ) } Ni =1 . Here, u it is the state of projected RM R i and ˜ l it is the output of local labelingfunction L i at time t , after synchronization with collaborating teammates on shared events. Algorithm3 details the constructive definition of the sequence. Algorithm 3:
Construct sequence of projected RM states and synchronized labeling function outputs.
Input: { s i s i ...s ik } Ni =1 , {R i } Ni =1 , { L i } Ni =1 Output: { ( s i , u i , ˜ l i )( s i , u i , ˜ l ) ... ( s ik , u ik , ˜ l ik ) } Ni =1 , { L i ( s i s i ...s ik )) } Ni =1 for i = 1 to N do u i ← u iI , ˜ l i ← ∅ , ξ i ← emptyString () for t = 1 to k − do for i = 1 to N do u itemp ← u it for e ∈ ˜ l it do u itemp ← δ i ( u itemp , e ) ξ i ← append ( ξ i , e ) u it +1 ← u itemp l it +1 ← L i ( s it +1 , u it ) for i = 1 to N do I e ← getCollaboratingAgents ( l it +1 ) ˜ l it +1 ← (cid:84) j ∈ I e l jt +1 , (Synchronization step) for i = 1 to N do L i ( s i s i ...s ik )) ← ξ i return { ( s i , u i , ˜ l i )( s i , u i , ˜ l ) ... ( s ik , u ik , ˜ l ik ) } Ni =1 , { L i ( s i s i ...s ik )) } Ni =1 We note that the sequences l l ...l k ∈ (2 Σ ) ∗ and ˜ l i ˜ l i ... ˜ l ik ∈ (2 Σ i ) ∗ of labeling function outputs aresequences of sets of events. However, given our assumption that L i ( s it +1 , u it ) outputs at most oneevent per time step, | ˜ l it | ≤ for every t . This assumption corresponds to the idea that only one eventmay occur to an individual agent per time step.However, the team labeling function L may return multiple events at a given time step, correspondingto all the events that occur concurrently to separate agents. Because R is assumed to be equivalent tothe parallel composition of a collection of component RMs, all such concurrent events are interleaved[1]; the order in which they cause transitions in R doesn’t matter. So, Given l l ...l k ∈ (2 Σ ) ∗ , thecorresponding sequence of events L ( s s ... s k ) ∈ Σ ∗ is constructed by iteratively appending elementsin l t to L ( s s ... s k ) , as detailed in Algorithm 3. Similarly, from sequence ˜ l i ˜ l i ... ˜ l ik ∈ (2 Σ i ) ∗ weconstruct the sequence of local events L i ( s i s i ...s ik ) ∈ Σ ∗ i .14 Theorem 1.
Given RM R and a collection of local event sets Σ , Σ , ..., Σ N such that (cid:83) Ni =1 Σ i = Σ ,let R , R , ... R N be the corresponding collection of projected RMs. Suppose R is bisimilar to theparallel composition of R , R , ... R N . Then given an event sequence ξ ∈ Σ ∗ , R ( ξ ) = 1 if and onlyif R i ( ξ i ) = 1 for all i = 1 , , ..., N . Otherwise, R ( ξ ) = 0 and R i ( ξ i ) = 0 for all i = 1 , , ...N .Proof. For clarity, recall that ξ i ∈ Σ ∗ i denotes the projection P Σ i ( ξ ) .Let R p = (cid:104) U p , u pI , Σ , δ p , σ p , F p (cid:105) = (cid:107) Ni =1 R i be the parallel composition of R , ..., R N . Note that R ( ξ ) = 1 if and only if δ ( u I , ξ ) ∈ F . Similarly, R i ( ξ i ) = 1 if and only if δ i ( u iI , ξ i ) ∈ F i for every i = 1 , ..., N . So, it is sufficient to show that δ ( u I , ξ ) ∈ F if and only if δ i ( u iI , ξ i ) ∈ F i for every i = 1 , ..., N . Given the assumption that R ∼ = R p , we can show by induction that δ ( u I , ξ ) ∈ F if andonly if δ p ( u pI , ξ ) ∈ F p (see chapter 7 of [1]). Now, given the definition of parallel composition, it isreadily seen that δ p ( u pI , ξ ) ∈ F p if and only if δ i ( u iI , ξ i ) ∈ F i for every i = 1 , ..., N [20], concludingthe proof. Theorem 2.
Given R , L , and Σ , ..., Σ N , suppose the bisimilarity condition from Theorem 1holds. Furthermore, assume L is decomposable with respect to Σ , ... Σ N with the correspondinglocal labeling functions L , ..., L N . Let s ... s k be a sequence of joint environment states and { s i ...s ik } Ni =1 be the corresponding sequences of local states. If the agents synchronize on sharedevents, then R ( L ( s ... s k )) = 1 if and only if R i ( L i ( s i ...s ik )) = 1 for all i = 1 , , ..., N . Otherwise R ( L ( s ... s k )) = 0 and R i ( L i ( s i ...s ik )) = 0 for all i = 1 , , ..., N .Proof. Note that given the result of Theorem 1, it is sufficient to show that for every i = 1 , , ..., N ,the relationship P Σ i ( L ( s s ... s k )) = L i ( s i s i ...s ik ) holds. Recall that L ( s s ... s k ) ∈ Σ ∗ denotesthe sequence of events resulting from team trajectory s ... s k , as described in §10. P Σ i ( L ( s s ... s k )) denotes the projection of this sequence onto local event set Σ i , as defined in §9. Finally, L i ( s i s i ...s ik ) ∈ Σ ∗ i denotes the sequence of events output by local labeling function L i , assumingthe agent synchronizes with its teammates on shared events.In words, we wish to show that for every i , the projection of the sequence output by labeling function L is equivalent to the sequence of synchronized outputs of local labeling function L i . To do this, weuse induction to show that the appropriate notion of equivalence holds at each time step t = 1 , ..., k .We write this statement as l t ∩ Σ i = ˜ l it for every i = 1 , ..., N and every t = 1 , ..., k . Here, l t is theoutput of labeling function L at time t and ˜ l it is the synchronized output of local labeling function L i (§10.2).At time t = 0 , l and ˜ l i are defined to be empty sets: no events have yet occurred (§10.2). So, trivially l ∩ Σ i = ˜ l i . Furthermore, u I ∈ u iI by definition of the projected initial state u iI . Recall that u I ∈ U is the initial state of RM R , and u iI ∈ U i is the initial state of projected RM R i .Now suppose that at some arbitrary time t , l t ∩ Σ i = ˜ l it and u t ∈ u it for every i = 1 , ..., N . We wishto show that this implies l t +1 ∩ Σ i = ˜ l it +1 and u t +1 ∈ u it +1 .Showing l t +1 ∩ Σ i = ˜ l it +1 : Recall our assumption that for any pair ( s it +1 , u it ) , L i ( s it +1 , u it ) outputsonly one event, corresponding to the idea that only one event may occur to an individual agent pertime step. Thus for any i = 1 , ..., N , l t +1 ∩ Σ i = L ( s t +1 , u t ) ∩ Σ i is either equal to { e } for some e ∈ Σ i or it is equal to the empty set. • If L ( s t +1 , u t ) ∩ Σ i = { e } , then L j ( s jt +1 , u jt ) = { e } for every j ∈ I e by definition of L being decomposable with corresponding local labeling functions L , L , ..., L N . Thus ˜ l it +1 = (cid:84) j ∈ I e L j ( s jt +1 , u jt ) = { e } . Recall that I e = { i | e ∈ Σ i } . • Suppose instead that L ( s t +1 , u t ) ∩ Σ i = ∅ . – If L i ( s it +1 , u it ) = ∅ , then clearly ˜ l it +1 = ∅ . – If L i ( s it +1 , u it ) = { e } for some e ∈ Σ i , then there exists some j ∈ I e such that e / ∈ L j ( s jt +1 , u jt ) by the definition of L being decomposable. Thus ˜ l it +1 = ∅ .15howing u t +1 ∈ u it : We begin by using the knowledge that l t ∩ Σ i = ˜ l it +1 , and we again proceed byconsidering the two possible cases. • If l t ∩ Σ i = ˜ l it = ∅ , then the projected RM R i will not undergo a transition and so u it +1 = u it .We know that u t ∈ u it , and that RM R doesn’t undergo any transition triggered by an eventin Σ i . So, by definition the projected states of R i , we have u t +1 ∈ u it . Thus, u t +1 ∈ u it +1 . • Now consider the case l t ∩ Σ i = ˜ l it = { e } . Projected RM R i will transition to a newstate u it +1 according to δ i ( u it , e ) . Assume, without loss of generality, that l t contains eventsoutside of Σ i which trigger transitions both before and after the transition triggered by e .That is, suppose a, b ∈ l t \ Σ i and that R undergoes the following sequence of transitions: u (cid:48) = δ ( u t , a ) , ˜ u = δ ( u (cid:48) , e ) , and finally u t +1 = δ (˜ u, b ) . Because u t ∈ u it and a / ∈ Σ i ,we know u (cid:48) ∈ u it . Because e ∈ Σ i , ˜ u ∈ ˜ u i for some ˜ u i ∈ U i not necessarily equal to u it .Finally, because b / ∈ Σ i , we have u t +1 ∈ ˜ u i . So, u t +1 ∈ ˜ u i for some projected state ˜ u i such that there exist states u (cid:48) ∈ u it and ˜ u ∈ ˜ u i such that ˜ u = δ ( u (cid:48) , e ) where e ∈ Σ i . By ourdefinition of δ i and our enforcement of it being a deterministic transition function (§9), u it +1 is the unique such state in U i , which implies ˜ u i = u it +1 . Thus u t +1 ∈ u it +1 .By induction we conclude the proof. Theorem 3.
Suppose the conditions in Theorem 2 are satisfied, then max { , V π ( s ) + V π ( s ) + ... + V π N ( s ) − ( N − } ≤ V π ( s ) ≤ min { V π ( s ) , V π ( s ) , ..., V π N ( s ) } . Proof.
The Fréchet conjunction inequality states that if E , E , ..., E N are a collection of eventswith probabilities P ( E ) , P ( E ) , ..., P ( E N ) , then: max { , P ( E ) + P ( E ) + ... + P ( E N ) − ( N − } ≤ P ( E ∧ E ∧ ... ∧ E N ) ≤ min { P ( E ) , P ( E ) , ..., P ( E N ) } . Recall that V π ( s ) denotes the expected future undiscounted reward returned by R , given the teamfollows joint policy π from joint environment state s ∈ S . We begin by noting that V π ( s ) isequivalent to the probability that the task encoded by R is eventually completed, given the teamfollows policy π from the initial state s and R begins in state u I . To see this, we note that givenan initial team state s ∈ S , any (possibly history dependent) team policy induces a probabilitydistribution over the set of all possible trajectories s s ... of team states. For any such trajectory, thecorresponding sum of undiscounted rewards returned by R will be if the trajectory results in thecompletion of the task described by R , and it will be otherwise. So, V π ( s ) is the expected valueof a Bernoulli random variable which takes the value if and only if the task is completed, and thusis equivalent to the probability of the task being completed under policy π . Similarly, V π i ( s ) whichis defined as the expected undiscounted future reward returned by R i , is equivalent to the probabilityof the task encoded by R i being completed given the team follows policy π .We note that trajectory s s ... will result in the eventual completion of the task described by R if andonly if there exists some k > such that R ( L ( s , ... s k )) = 1 . By the result of Theorem 2 however, R ( L ( s , ... s k )) = 1 if and only if R i ( L i ( s i ...s ik )) = 1 for all i = 1 , ..., N . So, the trajectory willresult in the eventual completion of the task described by R if and only if it also results in the eventualcompletion of all projected tasks described by R i , i = 1 , ..., N . So, the likelihood of completingthe task encoded by R under policy π is equivalent to the likelihood of completing all the tasksencoded by the collection {R i } Ni =1 under the same policy. Thus, with our interpretation of V π ( s ) and V π i ( s ) as these probabilities, we apply the Fréchet conjunction inequality to arrive at the finalresult. 16 Figure 4: Flowchart of DQPRM policy execution.Figure 4 details the interaction between theagents and the environment for a single timestep during testing. The i th agent uses itsstate s i , its local RM state u i , and its learnedcollection of q-functions Q i to select an ac-tion a i . Action a i contributes to the team’sjoint action a = ( a , ..., a N ) , which inducesan environment transition to a new joint state s (cid:48) = ( s (cid:48) , ..., s (cid:48) N ) . Each agent then interprets itslocal state pair ( s (cid:48) i , u i ) through its local labelingfunction L i to obtain the event e ∈ Σ i occurringat that time step. If e is a shared event, agent i will synchronize with all collaborating agentsbefore updating the state of its local RM R i .The team is successful if it is rewarded by theteam RM R .
13 A Ten-Agent Experiment
Figure 5 shows the experimental results for a ten-agent variant of the rendezvous task described in §5.To successfully complete the task, all agents must simultaneously occupy the rendezvous locationbefore proceeding to their respective goal locations. We observe that while the DQPRM algorithmconverges to a team policy relatively quickly, the baseline methods fail to converge within the allowednumber of training steps. For comparison, we recall that in the two-agent variant of the same task,h-IL outperforms DQPRM. In the three-agent variant of the task, DPQRM outperforms h-IL buth-IL converges within the allowed e training steps. This demonstrates the ability of the proposedDQPRM algorithm to scale relatively well with the number of agents. This capability is owed to thefact that by decomposing the task, DQPRM is able to train each agent entirely separately from itsteammates, while ensuring that the resulting composite behavior accomplishes for team’s task. (a) Ten-agent rendezvous task domain. (b) Ten-agent rendezvous task results. Figure 5: (a) The initial position of agent i is marked A i . The common rendezvous location for allagents is marked R and is highlighted in orange. The goal location for agent i is marked G ii