[PDF] Combining Propositional Logic Based Decision Diagrams with Decision Making in Urban Systems

Abstract

Solving multiagent problems can be an uphill task due to uncertainty in the environment, partial observability, and scalability of the problem at hand. Especially in an urban setting, there are more challenges since we also need to maintain safety for all users while minimizing congestion of the agents as well as their travel times. To this end, we tackle the problem of multiagent pathfinding under uncertainty and partial observability where the agents are tasked to move from their starting points to ending points while also satisfying some constraints, e.g., low congestion, and model it as a multiagent reinforcement learning problem. We compile the domain constraints using propositional logic and integrate them with the RL algorithms to enable fast simulation for RL.

Full PDF

CCombining Propositional Logic Based Decision Diagrams with Decision Makingin Urban Systems

Jiajing Ling, * Kushagra Chandak, * Akshat Kumar

School of Information SystemsSingapore Management University { jjling.2018, kushagrac, akshatkumar } @smu.edu.sg Abstract

Solving multiagent problems can be an uphill task due to un-certainty in the environment, partial observability, and scal-ability of the problem at hand. Especially in an urban set-ting, there are more challenges since we also need to maintainsafety for all users while minimizing congestion of the agentsas well as their travel times. To this end, we tackle the prob-lem of multiagent pathﬁnding under uncertainty and partialobservability where the agents are tasked to move from theirstarting points to ending points while also satisfying someconstraints, e.g., low congestion, and model it as a multiagentreinforcement learning problem. We compile the domain con-straints using propositional logic and integrate them with theRL algorithms to enable fast simulation for RL.

The emergence and continued rise of autonomous and semi-autonomous vehicles in the urban landscape has made itsway to a number of areas for transportation and mobility likeself-driving cars and delivery trucks, railways, unmannedaerial vehicles, delivery drones ﬂeet etc. Several key chal-lenges remain to manage such agents like maintaining safety(no collisions among vehicles), avoiding congestion andminimizing travel time to better serve the users and reducepollution. To model such scnarios, we leverage cooperativesequential multiagent decision making, where agents actingin a partially observable and uncertain environment are re-quired to take coordinated decisions towards a long termgoal (Durfee and Zilberstein 2013). Decentralized partiallyobservable MDPs (Dec-POMDPs) provide a rich frameworkfor multiagent planning (Bernstein et al. 2002; Oliehoek andAmato 2016), and are applicable in domains such as ve-hicle ﬂeet optimization (Nguyen, Kumar, and Lau 2017),cooperative robotics (Amato et al. 2019), and multiplayervideo games (Rashid et al. 2018). However, scalability re-mains a key challenge with even a 2-agent Dec-POMDPNEXP-Hard to solve optimally (Bernstein et al. 2002). Toaddress the challenge of scalability, several frameworks havebeen introduced that model restricted class of interactionsamong agents such as transition independence (Becker et al.2004; Nair et al. 2005), event driven and population-based * Figure 1:

Airspace management for drone trafﬁc (Hio 2016) interactions (Becker, Zilberstein, and Lesser 2004; Varakan-tham et al. 2012). Recently, several multiagent reinforce-ment learning (MARL) approaches are developed that pushthe scalability envelop (Lowe et al. 2017; Foerster et al.2018; Rashid et al. 2018) by using simulation-driven opti-mization of agent policies.Key limitations of several MARL approaches includesample inefﬁciency, and difﬁculty in learning when rewardsare sparse, which is often the case in problems with com-binatorial ﬂavor. We address such a combinatorial problemof multiagent path ﬁnding (MAPF) under uncertainty andpartial observability. Even the deterministic MAPF settingwhere multiple agents need to ﬁnd collision-free paths fromtheir respective sources to destinations in a shared environ-ment is NP-Hard (Yu and LaValle 2013).The MAPF problem is a general formulation that iscapable of addressing several applications in the domainof urban mobility like autonomous vehicle ﬂeet optimiza-tion (Ling, Gupta, and Kumar 2020; Sartoretti et al. 2019),taxiway path planning for aircrafts (Li et al. 2019), and trainrescheduling (Nygren and Mohanty 2020). Figure 1 showsthe airspace of a city divided into multiple geofenced air-blocks. Such structured airspace can be used by drones tosafely travel to their destinations (Ling, Gupta, and Kumar2020). Since such spaces can have a lot of constraints, theycan be modelled using our framework to manage the traf- a r X i v : . [ c s . A I] N ov c. Deep RL approaches have been applied to MAPF underuncertainty and partial observability (Sartoretti et al. 2019;Ling, Gupta, and Kumar 2020). A key challenge faced byRL algorithms is that it takes several simulations to ﬁndeven a single route to destination as model-free RL does notexplicitly exploits the underlying graph connectivity. Fur-thermore, agents can move in cycles, specially during initialtraining episodes, which makes the standard RL approacheshighly sample inefﬁcient. Recent approaches combine un-derlying graph structure with deep neural nets for combina-torial problems such as minimum vertex cover and travelingsalesman problem (Dai et al. 2017; Bello et al. 2019). How-ever, the knowledge compilation framework that we presentprovides much more explicit domain knowledge to RL ap-proaches for MAPF.To address the challenges of delayed rewards, and difﬁ-culty of ﬁnding feasible routes to destinations, we compilethe graph over which agents move in MAPF using proposi-tional logic based probabilistic sentential decision diagrams ( psdd ) (Kisa et al. 2014). A psdd represents probability dis-tributions deﬁned over the models of a given propositionaltheory. We use psdd to represent distribution over all simplepaths (without loops) for a given source-destination pair. Akey beneﬁt is that any random sample from a psdd is gauran-teed to be a valid simple path from the given source to desti-nation. Furthermore, psdd are also equipped with associatedinference methods (Shen, Choi, and Darwiche 2016a) (suchas computing conditional probabilities) that signiﬁcantly aidRL methods (e.g., given the current partial path, what are thepossible next edges that are guaranteed to lead to the des-tination via a simple path). Using psdd signiﬁcantly helpsin pruning the search space, and generate high quality train-ing samples for the underlying learning algorithm. However,integrating psdd with different RL methods is challenging,as the standard psdd inference methods are too slow to beused in the simulation-driven RL setting where one needsto query psdd at each time step. Therefore, we also develophighly efﬁcient psdd inference methods that speciﬁcally aidRL by enabling fast sampling of training episodes, and aremore than an order of magnitude faster than generic psdd inference. Given that number of paths between a source-destination can be exponential, we also use hierarchical de-composition of the graph to enable a tractable psdd repre-sentation (Choi, Shen, and Darwiche 2017a).To summarize, our main contributions are as follows. First , we compile static domain information such as under-lying graph connectivity using psdd for the MAPF prob-lem under uncertainty and partial observability.

Second ,we develop techniques to integrate such decision diagramswithin diverse deep RL algorithms based on policy gradi-ent and Q-learning.

Third , we develop fast algorithms toquery compiled decision diagrams to enable fast simulationfor MARL. We integrate our psdd -based framework withprevious MARL approaches (Sartoretti et al. 2019; Ling,Gupta, and Kumar 2020), and show that the resulting algo-rithms signiﬁcantly outperform the original algorithms bothin terms of sample complexity and solution quality on anumber of instances. We also highlight that psdd is a generalframework for incorporating constraints in decision making, and discuss extensions of the standard MAPF that can beaddressed using psdd . A Dec-POMDP is deﬁned using the tuple (cid:104)

S, A, T, O, Z, r, n, γ (cid:105) . There are n agents in environ-ment (indexed using i = 1 : n ). The environment can be inone of the states s ∈ S . At each time step, agent i choosesan action a i ∈ A , resulting in the joint action a ∈ A ≡ A n .As a result of the joint action, the environment transitions toa new state s (cid:48) with probability T ( s, a , s (cid:48) ) . The joint-rewardto the agent team is given as r ( s, a ) . The reward discountfactor is γ < .We assume a partially observable setting in which agent i ’s observation z i ∈ Z is generated using the observationfunction O ( a , s (cid:48) , z i ) = P ( z i | a , s (cid:48) ) where the last joint ac-tion taken was a , and the resulting state was s (cid:48) (for simplic-ity, we have assumed the observation function is the samefor all agents). As a result, different agents can receive dif-ferent observations from the environment.An agent’s policy is a mapping from its action-observation history τ i ∈ ( Z × A ) ∗ to actions or π i ( a i | τ i ; θ i ) ,where θ i parameterizes the policy. Let the discounted futurereturn be denoted by R t = (cid:80) ∞ k =0 γ k r k + t . The joint-valuefunction induced by the joint-policy of all the agents is de-noted as V π ( s t ) = E s t +1: ∞ , a t : ∞ (cid:2) R t | s t , a t (cid:3) , and joint action-value function as Q π ( s t , a t ) = E s t +1: ∞ , a t +1: ∞ (cid:2) R t | s t , a t (cid:3) .The goal is to ﬁnd the best joint-policy π to maximize thevalue for the starting belief b : V ( π ) = (cid:80) s b ( s ) V π ( s ) . Learning from simulation:

In the RL setting, we do nothave access to transition and observation functions T , O .Instead, multiagent RL approaches (MARL) learn via inter-acting with the environment simulator. The simulator, giventhe joint-action input a t at time t , provides the next environ-ment state s t +1 , generates observation z it +1 for each agent,and provides the reward signal r t . Similar to several previousMARL approaches, we assume a centralized learning anddecentralized policy execution (Foerster et al. 2018; Loweet al. 2017). During centralized training, we assume accessto extra information (such as environment state, actions ofdifferent agents) that help in learning value functions V π , Q π . However, during policy execution, agents rely on theirlocal action-observation history. An agent’s policy π i is typ-ically implemented using recurrent neural nets to conditionon action-observation history (Hausknecht and Stone 2015).However, our developed results are not affected by a partic-ular implementation of agent policies. MARL for MAPF:

MAPF can be mapped to a Dec-POMDP instance in multiple ways to address different vari-ants (Ma, Kumar, and Koenig 2017; Sartoretti et al. 2019;Ling, Gupta, and Kumar 2020). We therefore present theMAPF problem under uncertainty and partial observabil-ity using minimal assumptions to ensure the generality ofour knowledge compilation framework. There is a graph G = ( V, E ) where the set V denotes the locations whereagents can move, and edges connect different locations. Anagent i has a start vertex s i and ﬁnal goal vertex g i . At anytime step, an agent can be located at a vertex v ∈ V , or 𝑑 (a) 𝑠 𝑑 (b) Figure 2:

Undesirable path samples for MAPF. (a) Path with aloop; (b) Path to a deadend. Dark nodes are blocked. in-transit on an edge ( u, v ) (i.e., moving from vertex u to v ).An agent’s action set is denoted by A = A mov ∪ A oa .Intuitively, A mov denotes actions that intend to change thelocation of agent from the current vertex to a neighboringdirectly connected vertex in the graph (e.g., move up, right,down, left in a grid graph). The set A oa denotes other actionsthat do not intend to change the location of the agent (e.g., noop that intends to make agent stay at the current vertex).Note that we do not make any assumptions regarding the ac-tual transition after taking the action (i.e., move/stay actionsmay succeed or fail as per the speciﬁc MAPF instance).Depending on the states of all the agents, an agent i re-ceives observation z i . We assume that an agent is able tofully observe its current location (i.e., the vertex it is cur-rently located at). Other information can also be part of theobservation (e.g., location of agents in the local neighbor-hood of the agent), but we make no assumptions about suchinformation. We make no speciﬁc assumptions about thejoint-reward r , other than assuming that an agent prefers toreach its destination as fast as possible if the agent’s move-ment do not conﬂict with other agents’ movements. Typicalexamples of reward r include penalty for every time step anagent is not at its goal vertex, positive reward at the goalvertex, a high penalty for creating congestion at vertices oredges of the graph (Ling, Gupta, and Kumar 2020), or forblocking other agents from moving to their destination (Sar-toretti et al. 2019). A key challenge for RL algorithms for MAPF is that oftenﬁnding feasible paths to destinations require a large numberof samples. For example, ﬁgure 2(a) shows the case whenan agent loops back to one of its earlier vertex. Figure 2(b)shows another scenario where an agent moves towards adeadend. Such scenarios increase the training episode lengthin RL. Our key intuition is to develop techniques that ensurethat RL approaches only sample paths that are (i) simple, (ii)always originate at the source vertex s i and end at the goalvertex g i for any agent i .Let p it denote the path taken by an agent i until time t (or the sequence of vertices visited by an agent start-ing from source s i ). We also assume that it does not con-tain any cycle. This information can be extracted fromagent’s history τ it . Let a ∈ A mov be a movement actiontowards vertex a v . We assume the existence of a function feasibleActions(p t ; s i , d i ) that takes as input an agent’scurrent path p t and returns the set nextActions = { a ∈ A mov s.t. [p t , a v ] (cid:32) d i } . The condition [p t , a v ] (cid:32) d i im- plies there exists at least one simple path from source s i todestination d i that includes the path segment [p t , a v ] . Thus,starting with p = [ s i ] , the RL approach would only samplesimple paths that are guaranteed to reach an agent’s desti-nation, thereby signiﬁcantly pruning the search space, andresulting in trajectories that have good potential to generatehigh rewards. The information required for implementing feasibleActions can be compiled ofﬂine even before train-ing and execution of policy starts (explained in next section,using decision diagrams), and does not include any com-munication overhead during policy execution. Using thisabstraction, we next present simple and easy-to-implementmodiﬁcations to a variety of deep multiagent RL algorithms. Policy gradient based MARL:

We ﬁrst provide a briefbackground of policy gradient approaches for single agentcase (Sutton et al. 2000). An agent’s policy π θ is parameter-ized using θ . The policy is optimized using gradient ascenton the total expected reward V ( θ ) = E π θ [ R ] . The gradientis given as: ∇ θ V ( θ ) = E s ∞ ,a ∞ (cid:104) ∞ (cid:88) t =0 R t ∇ θ log π θ ( a t | s t ) (cid:105) (1)Above gradient expression is also extendible to the multia-gent case in an analogous manner (Peshkin et al. 2000; Fo-erster et al. 2018). In multiagent setting, we can computegradient of the joint-value function V w.r.t. an agent i ’s pol-icy parameters θ i or ∇ θ i V . The expectation is w.r.t. the jointstate-action trajectories E s ∞ , a ∞ , and R t denotes futurereturn for the agent team. The input to policy are some fea-tures of the agent’s observation history or φ ( τ i ) . The func-tion φ can be either hard-coded (e.g., only last two observa-tions), or can be learned using recurrent neural networks.For using compiled knowledge using the function feasibleActions , the only change we require is in the struc-ture of an agent’s policy π (we omit superscript i forbrevity). The main challenge is addressing the variable sizedoutput of the policy in a differentiable fashion. Assuminga deep neural net based policy π , given the discrete actionspace A , the last layer of the policy has | A | outputs usingthe softmax layer (to normalize action probabilities π ( a |· ) ).However, when using feasibleActions , the probability of ac-tions not in feasibleActions needs to be zero. However, theset feasibleActions changes as the observation history τ ofthe agent is updated. Therefore, a ﬁxed sized output layerappears to create difﬁculties. However, we propose an easyﬁx. We use ˜ π to denote the standard way policy π is con-structed with last layer having ﬁxed | A | outputs. However,we do not require the last layer to be a softmax layer. In-stead, we re-deﬁne the policy π as: π ( a | τ ) =  if a / ∈ feasibleActions(p( τ ); s, d ) else exp (cid:0) ˜ π ( a | φ ( τ )) (cid:1) (cid:80) a (cid:48)∈ feasibleActions(p( τ ); s,d ) exp (cid:0) ˜ π ( a (cid:48) | φ ( τ )) (cid:1) (2) where p( τ ) denotes the path taken by the agent so far, and s, d are its source and destination. Sampling from π guar-antees that invalid actions are not sampled. Furthermore, π s differentiable even when feasibleActions gives differentlength outputs at different time steps. The above operationcan be easily implemented in autodiff libraries such as Ten-sorﬂow without requiring a major change in the policy struc-ture π . Q-learning based MARL:

Deep Q-learning for the sin-gle agent case (Volodymyr et al. 2015) has been extendedto the multiagent setting also (Rashid et al. 2018). Inthe QMIX approach (Rashid et al. 2018), the joint action-value function Q tot ( τ , a ; ψ ) is factorized as (non-linear)combination of action-value functions Q i ( τ i , a i ; θ i ) of eachagent i . A key operation when training different param-eters θ i and ψ involves maximizing max a Q tot ( τ , a ; φ ) (for details we refer to Rashid et al.). This operationis intractable in general, however, under certain condi-tions, it can be approximated by maximizing individ-ual Q functions max a ∈ A Q i ( τ i , a i ) in QMIX. We requiretwo simple changes to incorporate our knowledge com-pilation scheme in QMIX. First, instead of maximizingover all the actions, we maximize only over feasible ac-tions of an agent as max a ∈ feasibleActions p( τ i ; s i ,d i ) Q i ( τ i , a ) .Second, in Q-learning, typically a replay buffer is alsoused which stores samples from the environment as ( τ , a , τ (cid:48) , r ) . In our case, we also store additionally theset of feasible actions for the next observation history τ (cid:48) i for each agent i as feasibleActions(p( τ (cid:48) i ); s i , d i ) alongwith the tuple ( τ , a , τ (cid:48) , r ) . The reason is when this tu-ple is replayed , we have to maximize Q i ( τ (cid:48) i , a ) over a ∈ feasibleActions(p( τ (cid:48) i ); s i , d i ) , and storing the set feasibleActions(p( τ (cid:48) i ); s i , d i ) would reduce computation.We have integrated our knowledge compilation frame-work with two policy gradient approaches proposed in (Sar-toretti et al. 2019; Ling, Gupta, and Kumar 2020) (one usingfeedforward neural net, another using recurrent neural net-work based policy), and a QMIX-variant (Fu et al. 2019)for MAPF, demonstrating the generalization power of theframework for a range of MARL solution methods. We now present our decision diagram based approach to im-plement the feasibleActions function. Let upper case letters( X ) denote variables and lowercase letters ( x ) denote theirinstantiations. Bold upper case letter ( X ) denotes a set ofvariables and their lower case counterparts ( x ) denote theinstantiations. Paths as a Boolean formula:

A path p from a given source s to the destination d in the underlying undirected graph G = ( V, E ) can be represented as a Boolean formula asfollows. Consider Boolean random variables X i,j for eachedge ( i, j ) ∈ E . If an edge ( i, j ) occurs in p , then X i,j isset to true, otherwise it’s set to false. Hence, conjunction ofthese literals denotes path p , and the Boolean formula repre-senting all paths is obtained by simply disjoining formulasfor all such paths (Choi, Tavabi, and Darwiche 2016). Anexample path in a graph is given in ﬁg 3(a). Sentential decision diagrams:

Since the number of pathsbetween two nodes can be exponential, we need a compact representation of the Boolean formula representing paths. Tothis end, we use sentential decision diagram or sdd (Dar-wiche 2011). It is a Boolean function f ( X , Y ) on some non-overlapping variable sets X , Y and is written as a decom-position in terms of functions on X and Y . In particular, f = ( p ( X ) ∧ s ( Y )) ∨ ... ∨ ( p n ( X ) ∧ s n ( Y )) , with each element ( p i , s i ) of the decomposition composed of a prime p i and a sub s i . A sdd represented as a decision diagramdescribes members of a combinatorial space (e.g., paths in agraph) using propositional logic in a tractable manner. It hastwo kinds of nodes:- terminal node , which can be a literal ( X or ¬ X ), alwaystrue ( (cid:62) ) or always false ( ⊥ ), and- decision node , which is represented as ( p ∧ s ) ∨ ... ∨ ( p n ∧ s n ) where all ( p i , s i ) pairs are recursively sdd s andthe primes are always consistent, mutually exclusive andexhaustive.Figure 3(b) represents an sdd for the graph in ﬁg 3(a)encoding all paths from n1 to n5. The encircled node isa decision node with two elements ( D, E ) and ( ¬ D, ⊥ ) .The primes are D and ¬ D and the subs are E and ⊥ . TheBoolean formula representing this sdd node is ( D ∧ E ) ∨ ( ¬ D ∧ ⊥ ) which is equivalent to D ∧ E . The Boolean for-mula encoded by the whole sdd is given by the root node ofthe sdd .An sdd is characterized by a full binary tree, called a vtree , which induces a total order on the variables from aleft-right traversal of the vtree. E.g., for the vtree in ﬁg-ure 3(c), the variable order is ( A, B, C, D, E ) . Given a ﬁxedvtree, the sdd is unique. An sdd node n is normalized (orassociated with) for a vtree node v as follows:- If n is a terminal node, then v is a leaf vtree node whichcontains the variable of n (if any).- If n is a decision node, then n ’s primes (subs) are normal-ized for the left (right) child of v .- If n is the root node, then v is the root vtree node.Intuitively, a decision node n being normalized for vtreenode v implies that the Boolean formula encoded by n con-tains only those variables contained in the sub-tree rooted at v . We will use this normalization property for our analysislater. The Boolean formula encoding the domain knowledgecan be compiled into a decision diagram using the sdd com-piler (Oztok and Darwiche 2015). The resulting sdd maynot be exponential in size even though it is representing anexponential number of objects. Probabilistic Sentential decision diagrams:

In our case,for computing feasibleActions , we also need to associatea probability distribution with the sdd that encodes all thepaths from a given source to destination. The key beneﬁtis that we can exploit associated inference methods such ascomputing conditional probabilities, which will help in com-puting feasibleActions .If we parameterize each of the decision nodes of the sdd ,such that the local parameters form a distribution, the result-ing probabilistic structure is called a psdd or a probabilis-tic sdd (Kisa et al. 2014). It can be used to represent discrete (a) D E ¬D ⊥ ¬DE D ⊥ ¬C C ⊥ C ¬C ⊥ B ¬B A ¬A ⊥ (b) A B C D E (c) D E ¬D ⊥ ¬DE D ⊥ ¬C C ⊥ C ¬C ⊥ B ¬B A ¬A ⊥ (d) Figure 3: (a) A simple path in a graph from s = n to d = n is highlighted in red and can be written as a propositional sentence A ∧ C ∧ E ∧ ¬ B ∧ ¬ D ; (b) An sdd for the graph in (a) where the encircled node represents a decision node ( p , s ) , ( p , s ) (c) a right-linear vtree for the sdd ; (d) psdd with parameters annotated on decision nodes. probability distributions P r ( X ) where several instantiations x have zero probability P r ( x ) = 0 because of the constraintsimposed on the space. More concretely, a psdd normalized for an sdd is deﬁned as follows:- For each decision node ( p i , s i ) , ... , ( p n , s n ) , there’s a pos-itive parameter θ i such that (cid:80) ni =1 θ i = 1 and θ i = 0 iff s i = ⊥ .- For each terminal node (cid:62) , there’s a parameter < θ < . psdd s are tractable models of probability distributions asseveral probabilistic queries can be performed in poly-timesuch as computing marginal probabilities, or conditionalprobabilities. NZ (Non-Zero) Inference for feasibleActions : Givenan sdd encoding all simple paths from a source s to a des-tination d , we uniformly parameterize this sdd as noted ear-lier. That is, for a decision node ( p i , s i ) , ... , ( p n , s n ) , each θ i is the same (except when s i = ⊥ , then θ i = 0 ). And wealso enforce that non-zero θ i s normalize to . This strategymakes sure that the probability of each simple path from s to d is non-zero. Assume that the current sampled path by theagent is p (in the context of psdd , we assume that p is a set ofedges in graph G traversed from source s by the agent). Let v p denote the current vertex of the agent (and assume v p isnot the destination). Let Nb( v p ) denote all direct neighborsof v p . The feasibleActions set is given as: feasibleActions(p) = { v (cid:48) ∈ Nb( v p ) ∧ ( v p , v (cid:48) ) / ∈ p ∧ P r (( v p , v (cid:48) ) | p) > } (3) That is, if the conditional probability

P r (( v p , v (cid:48) ) | p) = 0 ,then v (cid:48) can be pruned from the action set as it implies thereis no simple path to destination d that takes the edge ( v p , v (cid:48) ) after taking the path p . This strategy seems straightforwardto implement as psdd is equipped with inference methods tocompute conditional probabilities. However, in RL, this in-ference needs to be done at each time step for each trainingepisode. We observed empirically that this method was ex-tremely slow, and it was impractical to scale it for multiple agents. We therefore next develop our customized inferencetechnique that is much faster than this generic inference. Sub-context connectivity analysis for NZ Inference:

Wenote that all the discussion below is for a psdd that encodesall simple paths from a source s to destination d , and the psdd is normalized for some right linear vtree . Proofs fordifferent results are provided in the supplementary materialin the full paper available on Arxiv. Lemma 1.

In a psdd normalized for a right linear vtree,each prime is a literal ( X or ¬ X ) or (cid:62) . The above result is a direct consequence of the manner inwhich the underlying sdd is constructed using a right linearvtree.We sample a path from such a psdd by traversing it in atop-down fashion and selecting one branch at a time for eachof the decision nodes according to the probability for thatbranch and then selecting the prime and recursively goingdown the sub (Kisa et al. 2014). As all the prime nodes areterminal as per lemma 1, if the prime node is a positive literal X , then we select the edge e corresponding to X for ourpath (say e X ). If prime node is ¬ X , then we do not selectedge e X . We show in the supplement that the prime nodesencountered during such sampling procedure for a psdd thatencodes simple paths cannot be (cid:62) .As an example, consider the graph in ﬁg 3(a) and its cor-responding psdd in ﬁg 3(d). We start at the root of the psdd and select the left branch with probability 1. We then se-lect the prime A in our sample and recursively go downits sub as shown by the red arrows. The ﬁnal sampled pathis A − C − E and the corresponding Boolean formula is A ∧ ¬ B ∧ C ∧ ¬ D ∧ E . Deﬁnition 1. (S-Path) Let n be a psdd node normalizedeither for ˜ v l and ˜ v r , the two deepest vtree nodes. Let ( p , s ) , ... , ( p k , s k ) be the elements appearing on somepath from the psdd root to node n (i.e., n = p k or n = s k ).Then p ∧ ... ∧ p k ∧ n is called an s-path for node n , and isfeasible iff s i (cid:54) = ⊥ . n ﬁgure 3(c), ˜ v l is D , and ˜ v r is E . There can be multiples-paths for a node n . Let spset denote the set of all feasible s-paths for all psdd nodes n normalized either for ˜ v l or ˜ v r . Lemma 2.

There is a one-to-one mapping between s-pathsin the set spset and the set of all simple paths in G fromsource s to destination d . The above lemma states that if we ﬁnd a feasible s-pathin the psdd , then it would correspond to a valid simple pathfrom source s to destination d in the graph G which will alsohave nonzero probability as per our psdd . Reading off thepath in G given a feasible s-path is straightforward. A fea-sible s-path is also a conjunction of literals (using lemma 1,and if n is a sub, it will also be a literal as n is normalized fordeepest node in vtree). For each positive literal X in s-path,we include its corresponding edge e X in the path in G . Theset of resulting edges would form a simple path in G .This result also provides a strategy for our fast NZ infer-ence. Given a path p in graph G , our goal is to ﬁnd whether P r (( v p , v (cid:48) ) | p) > . If we can prove that there exists an s-path sp ∈ spset such that its corresponding path in graph G (using lemma 2) contains all the edges in p and ( v p , v (cid:48) ) , then P r (( v p , v (cid:48) ) | p) must be nonzero. We need few additional re-sults below to turn this insight into an efﬁcient algorithmicprocedure. Deﬁnition 2. (Sub-context (Kisa et al. 2014)) Let ( p , s ) , ... , ( p k , s k ) be the elements appearing on somepath from the sdd root to node n (i.e., n = p k or n = s k ).Then p ∧ ... ∧ p k is called a sub-context sc for node n , andis feasible iff s i (cid:54) = ⊥ . Notice that a psdd node n can have multiple (feasible)sub-context as a psdd is a directed acyclic graph (DAG). Es-sentially, each sub-context corresponds to one possible wayof reaching node n from the psdd root. For a right linearvtree, a feasible sub-context is a conjunction of literals as allprimes are literals (lemma 1).Given two psdd nodes n and n (cid:48) , we say that n (cid:48) is deeper than n if the vtree node v (cid:48) for which n (cid:48) is normalized isdeeper than vtree node v for which n is normalized. Deﬁnition 3. (Sub-context set) Let X be a positive literal,and let p i , ... , p i k be psdd prime nodes such that each p i = X . Let ssc ... ssc k be sets such that each ssc j contains allthe feasible sub-contexts of p i j . Then the sub-context set of X denoted by sset( X ) is deﬁned as sset( X ) = ∪ kj =1 ssc j We now show the procedure to perform sub-context con-nectivity analysis for NZ inference. Assume that the currentsampled path in graph G is p = { e , ... , e k } (each e i is theedge traversed by the agent so far). Let the current vertex ofthe agent be v p . Let e = ( v p , v (cid:48) ) be one possible edge in G that the agent can traverse next. Let X e , ... , X e k , X e bethe respective Boolean variables for the different edges. Wewish to determine whether P ( X e | X e , ... , X e k ) is greaterthan zero. We follow the following steps to determine this.1. Find the variable ˜ X ∈ { X e , ... , X e k , X e } that is deepestin the vtree order. shorthand for P ( X e = 1 | X e = 1 , ... , X e k = 1)

2. Check if there exists a sub-context sc ∈ sset( ˜ X ) suchthat sc contains all the positive literals { X e , ... , X e k , X e } .Concretely, check if ∃ sc ∈ sset( ˜ X ) s.t. sc ∧ X = sc , ∀ X ∈{ X e , ... , X e k , X e } . Denote this sub-context sc ∗ (if exists).3. Since sc ∗ is the sub-context of the variable deepest in thevtree order among { X e , ... , X e k , X e } , it can be extendedto a feasible s-path sp ∈ spset such that sp contains sc ∗ (or sc (cid:63) ∧ sp = sp ). (Proved formally in supplemen-tary). Therefore, we have shown the existence of a feasi-ble s-path sp that contains all literals { X e , ... , X e k , X e } ,and by lemma 2, there also exists a simple path ingraph G that contains the edges { e , ... , e k , e } . Therefore, P ( X e | X e , ... , X e k ) is non-zero.4. If sc ∗ does not exist, then a feasible s-pathcannot be found containing all the literals { X e , ... , X e k , X e } (proved in supplementary). There-fore, P ( X e | X e , ... , X e k ) is zero.Step number 2 in the method above is computationally themost challenging. We develop additional results in the sup-plementary material that further optimize this step, resultingin a fast and practical algorithm for NZ inference. Hierarchical clustering for large graphs:

For increasingthe scalability of the psdd framework and NZ inference forlarge graphs, we take motivation from (Choi, Shen, and Dar-wiche 2017b; Shen et al. 2019). These previous results showthat by suitably partitioning the graph G among clusters,we can keep the size of the psdd tractable even for verylarge graphs. Such partitioning does result in the loss of ex-pressiveness as the psdd for the partitioned graph may omitsome simple paths, but empirically, we found that this par-titioning scheme still improved efﬁciency of the underlyingRL algorithms signiﬁcantly. This partitioning method is de-scribed in the supplementary material in more detail. The framework that we presented can be used to compile anumber of different kinds of constraints. For example, theagent has to ﬁrst go to a pickup location and then to a deliv-ery location (Liu et al. 2019), or TSP-like constraints wherethe agent has to visit some locations before reaching the des-tination while avoiding collisions. An example is explainedbelow in more detail.

Landmark Constraints:

This framework can be extendedto settings where an agent is required to visit some land-marks before reaching the destination. We can construct theBoolean formula representing such a constraint by taking in-cident edge variables for each of the landmarks and allowingat least one of them to be true. We can then multiply (Shen,Choi, and Darwiche 2016b) the PSDD representing such aformula with the PSDD representing simple path constraint.For example, if n i is a node representing a landmark and A, B, C, D are Boolean variables representing the edges in-cident on n i , then we can represent the constraint for n i as β i = A ∨ B ∨ C ∨ D . For k such landmarks, we can simi-larly represent the constraints β , . . . , β k . Then the Booleanformula for all the landmarks would be β = (cid:86) ki =1 β i andan be compiled as a PSDD. Now, if α is a PSDD represent-ing simple paths between a source and a destination, thenwe can multiply α and β to get the ﬁnal PSDD representingsimple paths where an agent is required to visit some land-marks before the destination. This strategy can be scaled upby hierarchical partitioning of the graph (Choi, Shen, andDarwiche 2017b; Shen et al. 2019) and can be used to repre-sent complex constraints by multiplying them. This processis also modular since the constraints are separately modeledfrom the underlying graph connectivity.Furthermore, this framework can also be used in caseswhere the underlying graph connectivity is dynamic; e.g.,in scenarios where edges are dynamically getting blockedover time or the graph is revealed with time like the Cana-dian Traveller Problem (Liao and Huang 2014). Any obser-vation about blocked edges at a time can become the evi-dence, and by conditioning on this evidence, the agent canrule out routes via such blocked edges. The generalizabil-ity and ﬂexibility of this framework make it a promisingapproach in combining domain knowledge with models forRL, pathﬁnding, and other areas. We present results to show how the integration of our frame-work with previous multiagent deep-RL approaches basedon policy gradient and Q-learning (Sartoretti et al. 2019;Ling, Gupta, and Kumar 2020) performs better in MAPFproblems in terms of both sample efﬁciency and solutionquality on a number different maps with different numberof agents.

Simulation Speed:

We show comparisons between ourmethod and psdd inference method for calculating marginalprobabilities. Our approach is more than an order of magni-tude faster.

Approach No Clustering Clustering psdd inference 26.55 158.41 979.71 402665.98Table 1:

Simulation speed comparison (in seconds)

Open Grid Maps:

We next evaluate the integration of ourknowledge-based framework with policy gradient and Q-learning based approaches. We combine our framework withDCRL (Ling, Gupta, and Kumar 2020) and MAPQN (Fuet al. 2019) on several open grid maps with varying num-ber of agents. DCRL is a policy gradient based algorithm,and MAPQN is a Q-learning based algorithm. We followthe same MAPF model as (Ling, Gupta, and Kumar 2020)where each node has its own capacity (maximum numberof agents that can be accommodated), and agents can takemultiple time steps to move between two contiguous nodes.The total objective is to minimize sum of costs (SOC) of allagents combined with penalties for congestion. More detailson the experiments, the neural network structure and the hy-perparameters are noted in the supplementary material.The environment setting is varying from 4x4, 2 agents up to 10x10, 30 agents. We generated 10 instances for each set-ting. In each setting, we follow (Ling, Gupta, and Kumar2020) to randomly select the sources and destinations and tospecify the capacity of each node. We also specify the minand max time ( t min , t max ) to move between two nodes. Werun for each instance three times, and we terminate the runseither after 500 iterations or 10 hours. Each episode has amaximum length of 500 steps. For each instance, we choosethe run with the best performance. We compute the total ob-jective averaged over all agents and the cumulative numberof samples averaged over all agents during training. Finally,we plot the average total objective vs the average cumulativesample count over all instances.Figure 4 shows the results comparing DCRL with Knowl-edge Compilation (DCRL+KCO) and DCRL on 4x4, 8x8,and 10x10 grids (plots for 4x4, 2 agents, 8x8, 6 agents, and10x10, 10 agents are deferred to the supplementary). Al-though all agents are able to reach their respective destina-tions (no stranded agents) in both DCRL+KCO and DCRL,agents are trained to reach destinations cooperatively withsigniﬁcantly fewer samples in DCRL+KCO. It means thatagents are exploring the environment more efﬁciently inDCRL+KCO than in DCRL especially during the initial fewtraining episodes. This is also reﬂected in the plot as the av-erage total objective in DCRL+KCO is signiﬁcantly higherduring initial training phase compared to DCRL.Figure 5 shows the comparison of sample efﬁciency be-tween MAPQN+KCO and MAPQN on 4x4 and 8x8 grids.We did not evaluate MAPQN+KCO on 10x10 grid sinceMAPQN itself is not able to train a large number of agentson large grid maps (more details in (Ling, Gupta, and Kumar2020)). We observe that MAPQN+KCO converges fasterand to a better quality than MAPQN especially on 8x8 grid.This is because several agents did not reach their destinationwithin the episode cutoff in MAPQN, in contrast, all agentsreach their destination in MAPQN+KCO. Obstacles:

We evaluate KCO with DCRL and MAPQN ona 10x10 obstacle map with varying number of agents (from2 agents up to 10 agents). The obstacles are randomly gen-erated with density 0.35. We generate 10 instances for thissetting. For each instance, sources and destinations are ran-domly generated from the non-blocked nodes from the topand bottom rows (each source and destination pair is guar-anteed to be reachable). Other parameters are speciﬁed inthe same way as the above experiments. This set of experi-ments is quite challenging especially when there are severalagents since they can go into dead ends easily while cooper-ating with each other to reduce the congestion level. Figure 6clearly shows that DCRL and MAPQN can converge muchfaster with the integration of KCO and conﬁrms that our ap-proach is more sample efﬁcient. Speciﬁcally for MAPQN,several agents did not reach their destinations (8.8 agentson average, for N10 case), whereas in MAPQN+KCO, allagents reached destination, which explains much better so-lution quality by MAPQN+KCO.We also evaluate KCO integrated with the PRIMALframework (Sartoretti et al. 2019) which is based on asyn-chronous advantage actor-critic or A3C (Mnih et al. 2016)combined with imitation learning. We test it on a 10x10 map $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H '&5/1 '&5/.&21 '&5/1 '&5/.&21 (a) 4x4 grid $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H '&5/1 '&5/.&21 '&5/1 '&5/.&21 (b) 8x8 grid $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H '&5/1 '&5/.&21 '&5/1 '&5/.&21 (c) 10x10 grid Figure 4:

Sample efﬁciency comparison between DCRL+KCO and DCRL on open grids (N $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H 0$3411 0$341.&21 0$3411 0$341.&21 (a) 4x4 grid $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H 0$3411 0$341.&21 0$3411 0$341.&21 (b) 8x8 grid Figure 5:

Sample efﬁciency comparison between MAPQN+KCOand MAPQN on open grids (higher quality better) $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H '&5/1 '&5/.&21 '&5/1 '&5/.&21 (a) DCRL+KCO $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H 0$3411 0$341.&21 0$3411 0$341.&21 (b) MAPQN+KCO Figure 6:

Sample efﬁciency results on 10x10 grid with obstacles with obstacles, keeping the density high (0.35). We gener-ated 10 instances and tested with 2 and 4 agents. As notedin (Sartoretti et al. 2019), high obstacle density is particu-larly problematic for PRIMAL. Our results in Figure 7 showthat PRIMAL+KCO clearly outperforms PRIMAL in termsof sample efﬁciency. With 2 agents, the average SOC in PRI-MAL is ﬂuctuating around 250 during the initial episodes(maximum episode length is 256). However, the averageSOC in PRIMAL+KCO is quite low during the initial train-ing phase as expected (lower is better). With 4 agents, al-though the average SOC is quite high in both PRIMAL andPRIMAL+KCO during the initial episodes, the average SOCby PRIMAL+KCO is still lower than that by PRIMAL. Thereason for the initial high average SOC is that the agents aretrying to avoid collisions by taking a lot of noop actions be-cause of the high density. Overall, our framework is ﬂexibleenough to be integrated with different MARL approachesand consistently improve the performance.

We addressed the problem of cooperative multiagentpathﬁnding under uncertainty. Our work compiled static do-main information such as underlying graph connectivity us-ing propositional logic based decision making diagrams. We $YHUDJH6DPSOH&RXQW H $ Y H U DJH 6 2 & H 35,0$/.&235,0$/ (a) 2 agents $YHUDJH6DPSOH&RXQW H $ Y H U DJH 6 2 & H 35,0$/.&235,0$/ (b) 4 agents Figure 7:

Sample efﬁciency comparison on obstacle maps developed techniques to integrate such diagrams with deepRL algorithms such as Q-learning and policy gradient. Fur-thermore, to make simulation faster for RL, we developedan algorithm by analyzing the sub-context connectivity. Weshowed that the simulation speed of our algorithm is fasterthan the generic psdd method. We demonstrated the effec-tiveness of our approach both in terms of sample efﬁciencyand solution quality on a number of instances. upplementaryAppendix A

Proof of Lemma 1

Proof.

Consider a psdd normalized for a right-linear vtree. A vtree is right-linear if each left child for each of its internal nodesis a leaf. Since primes are deﬁned only for decision nodes, consider a psdd decision node n normalized for a vtree node v .Using the deﬁnition of normalization, the primes primes p , ... , p k of n are normalized for the left child of v . Since each leftchild in the vtree is a leaf (because the vtree is right-linear) which contains a single variable (let’s say X ), hence each of theprimes p , ... , p k are literals X , ¬ X or the constant (cid:62) . Prime nodes encountered during sampling of a psdd encoding simple paths cannot be (cid:62) : For a psdd normalized for aright linear vtree encoding simple paths between a source and a destination in a graph G , let p , ... , p k , s k be the sampled psdd nodes (primes or sub) and let X , ... , X k , X k +1 be the corresponding literals (Lemma 1). Assume, on the contrary, that a prime p i = (cid:62) . Because of the psdd semantics, the corresponding literal can be X i or ¬ X i . Only one of X i or ¬ X i would form asimple path but not both. Hence our assumption was wrong and p i (cid:54) = (cid:62) . Example:

For example, in Figure 2(d) (main paper), all the primes in the psdd are literals and none of the primes are (cid:62) , sincethe psdd represents all simple paths between the nodes n and n in the graph G in Figure 2(a) (main paper) Proof of Lemma 2

Proof.

Mapping:

Let spset denote the set of all feasible s-paths in a psdd that encodes simple paths between a source s anda destination d in an undirected graph G = ( V, E ) . Also, let smset denote the set of all simple paths between s and d in G .Now consider the mapping f ( sp ) = sm , ∀ sp ∈ spset and ∀ sm ∈ smset , which maps all elements ( p i or n ) in sp such that weinclude the edge corresponding to the literal of the element in our path if the literal is positive and we don’t include it if it’s isnegative. This is true because each prime is a literal corresponding to an edge in G (Lemma 1). Example:

As an example, consider the psdd in ﬁg 2(d) and an s-path sp = A ∧ ¬ B ∧ C ∧ ¬ D ∧ E indicated by the red arrows.Now consider the mapping f where the prime A is mapped to the edge ( n , n , B is mapped the edge ( n , n etc. as shownin ﬁg 2(a). Then sp represents the path A − C − E in G . f is one-to-one: To show that f is one-to-one, assume otherwise. Let sp and sp be two different feasible s-paths for which f ( sp ) = f ( sp ) . If sp and sp are different, there exists at least one element x in sp which is different from x in sp ( x , x are p i or n ). But since x and x also correspond to edges, f ( sp ) and f ( sp ) represent two different simple paths in G , which is false. Hence sp = sp and f is one-to-one. Note:

We can also show the other way, i.e., the set of all paths from source s to destination d in G can be mapped to s-pathsin the set spset . Consider a simple path from s to d . Now, start from the root of the psdd and map edge e to its correspondingliteral X e if it is present in the simple path and if e is not in the simple path, map it to ¬ X e and keep going down the psdd untilthe last node (prime or sub). This forms an s-path and is feasible because if it was not, then one of the false sub would havemade everything below it false (see proof of step 3 of the procedure). This mapping is one-to-one as well and can be proved ina similar manner as described above. Proof of step 3 of the procedure

Proof. sc (cid:63) can be extended to a feasible s-path sp ∈ spset such that sp contains sc (cid:63) (or sc (cid:63) ∧ sp = sp ).To show that sc (cid:63) can be extended to a feasible s-path, we ﬁrst start from the psdd node for which sc (cid:63) is deﬁned and go downthe psdd till the deepest node (prime or sub) and selecting the primes (or the sub) encountered and constructing an s-path sp .We show sp is feasible by contradiction. Assume that there’s no feasible s-path that can be constructed from sc (cid:63) . This impliesthat all subs encountered in the path from s (cid:63) to the deepest node in the psdd are false. This, in turn, implies that the sub of thecorresponding prime for which sc (cid:63) is deﬁned is false too. But this cannot be true since sc (cid:63) is a feasible sub-context. Hence,there is at least one s-path sp to which sc (cid:63) can be extended, i.e., sc (cid:63) ∧ sp = sp . Example:

Suppose sc (cid:63) is the sub-context for the node C and is given by A ∧ ¬ B ∧ C . If we go down the psdd and select theliterals encountered, i.e., ¬ D, E , we can construct a feasible s-path A ∧ ¬ B ∧ C ∧ ¬ D ∧ E . Note:

In the procedure, if ˜ X represents a sub, which only happens if ˜ X is the deepest in the vtree, we check if a sub-context sc ∈ sset ˜ X contains all the positive literals { X e , ... , X e k } (i.e. we do not check for X e ). Proof of step 4 of the procedure

Proof. If sc (cid:63) does not exist then a feasible s-path cannot be found containing all the literals { X e , ... , X e k , X e } .We can easily show this by contradiction. Assume, on the contrary, that if sc (cid:63) does not exit then there exists a feasible s-path sp exists containing all the literals { X e , ... , X e k , X e } . Since sc (cid:63) does not exist, the sub-context sc that we are extending to sp does not contain at least one of the variables in { X e , ... , X e k , X e } . But this implies sp is not a valid s-path. Therefore, if sc (cid:63) does not exist then a feasible s-path cannot be found. xample of the procedure Consider the graph in Figure 2(a) (main paper) with source s = n and destination d = n . The corresponding psdd and vtreeis given in Figure 2(c) (main paper) and Figure 2(d) (main paper). Let the partial path be p with edges { ( n , n , ( n , n } or their corresponding Boolean variables { A, B } . Now we want to ﬁnd if the edge e = ( n , n can be selected, i.e., if P r ( X e = D | A, B ) > . First we ﬁnd ˜ X , which turns out to be the literal D . We also compute sset( D ) = { ( A ∧ B ∧¬ C ∧ D ) } .We can clearly see that sc (cid:63) = ( A ∧ B ∧ ¬ C ∧ D ) contains all the literals in the set { A, B } . Now we see that sc (cid:63) can be extendedto an spath sp = ( A ∧ B ∧ ¬ C ∧ D ∧ E ) since sc (cid:63) ∧ sp = sp . Optimization of step 2

We now present how we optimize step number 2 by pre-processing and pruning of sub-contexts. Assume the partial path is p = { e , ... , e k } in the graph G . Let ˜ X ∈ { X e , ... , X e k } be the deepest variable in the vtree order. Let sset ∗ ( ˜ X ) be the setwhere each element sc (cid:48) in the set satisﬁes the constraint sc (cid:48) ∧ X = sc (cid:48) , ∀ X ∈ { X e , ... , X e k } . Now we look at one possibleedge e that the agent can traverse next given the partial path. Assume the corresponding Boolean variable is X e . Step 2 couldbe executed as follows:• Case 1: If ˜ X is deeper than X e , given sc (cid:48) ∈ sset ∗ ( ˜ X ) , we check ∃ sc ∈ sset( X e ) s.t. sc (cid:48) ∧ sc = sc (cid:48) .• Case 2: If X e is deeper than ˜ X , given sc (cid:48) ∈ sset ∗ ( ˜ X ) , we check ∃ sc ∈ sset( X e ) s.t. sc ∧ sc (cid:48) = sc .Intuitively, if we have sc (cid:48) ∧ sc = sc (cid:48) or sc ∧ sc (cid:48) = sc , then there is a path in psdd connecting one psdd prime node whosesub-context is sc and the other psdd prime node whose sub-context is sc (cid:48) . If there exists at least one sub-context sc , then edge e is the edge that the agent can traverse next given the partial path p = { e , ... , e k } . When the partial path actually becomes p = { e , ... , e k , e } , we will update sset ∗ ( ˜ X ) or create sset ∗ ( X e ) (we will describe later). We now prove the correctness of thisoptimization of step 2.• Case 1: For each sc (cid:48) ∈ sset ∗ ( ˜ X ) , we have sc (cid:48) ∧ X = sc (cid:48) , ∀ X ∈ { X e , ... , X e k } . If there exist a sub-context sc ∈ sset( X e ) anda sub-context sc (cid:48) ∈ sset( ˜ X ) such that sc (cid:48) ∧ sc = sc (cid:48) , then we will also have sc (cid:48) ∧ X e = sc (cid:48) . Since sc (cid:48) is the sub-context of thevariable deepest in the vtree order among { X e , ... , X e k , X e } , it can be extended to a feasible s-path sp ∈ spset such that sp contains sc (cid:48) (proved earlier).• Case 2: For each sc (cid:48) ∈ sset ∗ ( ˜ X ) , we have sc (cid:48) ∧ X = sc (cid:48) , ∀ X ∈ { X e , ... , X e k } . If there exist a sub-context sc ∈ sset( X e ) anda sub-context sc (cid:48) ∈ sset( ˜ X ) such that sc ∧ sc (cid:48) = sc , then we will have sc ∧ X = sc, ∀ X ∈ { X e , ... , X e k , X e } . Since sc isthe sub-context of the variable deepest in the vtree order among { X e , ... , X e k , X e } , it can be extended to a feasible s-path sp ∈ spset such that sp contains sc (proved earlier).Evaluating sc (cid:48) ∧ sc = sc (cid:48) or sc ∧ sc (cid:48) = sc can be done in advance which is the pre-processing step to check connectivity of twosub-contexts. Now we present Algorithm 1 to describe this pre-processing. Intuitively, we give each sub-context sc ∈ sset( X ) a unique ID, and we store the IDs of any two sub-contexts sc ∈ sset( X ) and sc (cid:48) ∈ sset( X (cid:48) ) that are connected in a hash table.Algorithm 2 describes the NZ inference for a possible edge e given the partial path p = { e , ... , e k } in graph G by using the connectivity hash table returned from Algorithm 1. When updating sset ∗ ( ˜ X ) , we prune sub-context sc (cid:48) whose ID is not inthe IDsT oU pdate ; When creating sset ∗ ( X e ) , we will add sc from sset( X e ) whose ID is in IDsT oU pdate . lgorithm 1: Pre-processing of sub-contexts Input : sset( X ) , ∀ X ∈ { X , . . . , X n } (set of vtree variables); connectivity = HashT able () for X in { X , . . . , X n } do for X (cid:48) in { X , . . . , X n } \ { X } do if X (cid:48) is deeper than X in vtree then connectivity [( X, X (cid:48) )] =

HashT able () for sc, scID in enumerate( sset ( X )) do connectivity [( X, X (cid:48) )][ scID ] =

List () for sc (cid:48) , sc (cid:48) ID in enumerate( sset ( X (cid:48) )) do if sc (cid:48) ∧ sc = sc (cid:48) then add sc (cid:48) ID to connectivity [( X, X (cid:48) )][ scID ] else connectivity [( X, X (cid:48) )] =

HashT able () for sc, scID in enumerate( sset ( X )) do connectivity [( X, X (cid:48) )][ scID ] =

List () for sc (cid:48) , sc (cid:48) ID in enumerate( sset ( X (cid:48) )) do if sc ∧ sc (cid:48) = sc then add sc (cid:48) ID to connectivity [( X, X (cid:48) )][ scID ] Output : connectivity Algorithm 2: NZ inference Input : connectivity , p = { e , ... , e k } , sset ∗ ( ˜ X ) , a possible edge e if ˜ X is deeper than X e then for sc (cid:48) , sc (cid:48) ID in enumerate( sset ∗ ( ˜ X ) ) do IDsT oU pdate = List () if connectivity [( ˜ X, X e )][ sc (cid:48) ID ] (cid:54) = ∅ then add sc (cid:48) ID to IDsT oU pdate else for sc (cid:48) , sc (cid:48) ID in enumerate( sset ∗ ( ˜ X ) ) do IDsT oU pdate = List () if connectivity [( ˜ X, X e )][ sc (cid:48) ID ] (cid:54) = ∅ then add connectivity [( ˜ X, X e )][ sc (cid:48) ID ] to IDsT oU pdate if IDsT oU pdate (cid:54) = ∅ then e is the edge that the agent can traverse next. if p = { e , ... , e k , e } then if ˜ X is deeper than X e then Update sset ∗ ( ˜ X ) else Create sset ∗ ( X e ) oute distribution and map partitioning To partition a map represented as an undirected graph G = ( V, E ) , we partition its nodes V into regions or clusters c , ... , c m ,with each cluster c i having internal and external (that cross into c i ) edges. On these clusters, we induce a graph G p with c , ... , c m as nodes. We then deﬁne constraints on X using G and G p that paths that are simple in G p are also simple w.r.t G and induce a distribution P r ( X ) over them. More concretely, paths cannot enter a region twice and they also cannot not visitany nodes inside the clusters twice. We represent all the simple paths inside the clusters c , ... , c m and also across the clustersas psdd s. This is a hierarchical representation of paths in which we have two levels of hierarchy, one for across the clusters andanother for inside the clusters. Example:

Consider Figure 8(a), where a 4x4 grid map is partitioned into clusters c , ... , c and the graph G p is formed fromthese clusters as nodes. Figure 8(b) represents inside of a cluster which is a 2x2 grid map. The black edges are the internaledges and the red edges are the external ones. We construct psdd s between all the red nodes inside the cluster and also a psdd for the 2x2 grid map formed by c , ... , c . Let’s say Figure 8(b) represents cluster c , i.e., node 1 is mapped to m , 2 is mappedto m and so on. Similarly, edge ( m , m is mapped to the edge (2 , , ( m , m to (6 , and so on. Now, to sample a paththat starts from node 1, we start from m (or m ) to enter c . We keep sampling until we encounter an external edge. If, forexample, we encounter the edge ( m , m , we traverse the edgeg (6 , in the 4x4 grid and move to the cluster c and keepsampling until we reach the destination. (We discard ( m , m for c because it is not mapped to any of the edge in the 4x4grid).

13 14 𝒄 𝒄 𝒄 𝒄 (a) m3 m1 m4 m2m6m5 m9 m7 m8 m10m11m12 (b) Figure 8: (a)A 4x4 grid map partitioned into regions (or clusters) c , ... , c . c , ... , c form a 2x2 grid map G p (b) Inside of a cluster, e.g., c . The black edges are the inernal edges and the red edges are the external edges. Appendix B: Empirical Evaluation

To represent psdd and sdd in our experiments, we use the

GRAPHILLION (Inoue et al. 2016) package to ﬁrst construct a ZDDand then convert it to sdd (Nishino et al. 2016, 2017). We also use the

PySDD (Darwiche et al. 2018) package for constructing sdd s and

PyPSDD package for constructing and doing inference on psdd . Simulation Speed:

We evaluated the sampling speed of the psdd inference method for computing conditional probabilities ( ? )and our approach based on sub-context connectivity analysis for NZ inference (SCANZ) in open grid maps of different sizes3x3, 4x4, 5x5, and 10x10. The experiments were performed on a single desktop machine with an Intel i7-8700 CPU and 32GBRAM (a 64 cores CPU and 256GB RAM machine for 10x10 grid). For each map, the source and destination are the top rightnode and bottom left node respectively. We randomly generate 10,000 paths given the source and destination pairs using bothSCANZ and psdd conditional probabilities and calculate the running time for the entire path simulation. Table 2 shows thatSCANZ is more than an order of magnitude faster than psdd inference. Approach nonhierarchical hierarchical psdd inference 26.55 158.41 979.71 402665.98Table 2:

Simulation speed comparison (in seconds) https://github.com/art-ai/pypsdd To test psdd inference on 10x10, we use the code from here: https://github.com/hahaXD/hierarchical map compiler, which is basedon (Choi, Shen, and Darwiche 2017b; Shen et al. 2019) xperimental Settings:

To compare our approach with DCRL, we follow the same settings in (Ling, Gupta, and Kumar 2020).For each grid map, sources and destinations are the top and bottom rows. For each agent, we randomly select its source anddestination from the top and bottom row. The capacity of each node is sampled uniformly from [1, 2] for 4x4 grid, [1, 3] for8x8 grid, and [1, 4] for 10x10 grid. For 10x10 grid with obstacles (as shown in Figure 9(b)), the capacity of each node issampled uniformly from [1, 2] for 2 agents, [1, 3] for 5 agents, and [1, 4] for 10 agents. The t min , t max for moving betweentwo contiguous zones are 1, 5 respectively. We used the same 10x10 grip with obstacle map for evaluating PRIMAL+KCO andPRIMAL. The locations of obstacles are ﬁxed. We generated 10 instances for 2 agents, 5 agents, and 10 agents respectively. Foreach instance, the source and destination for an agent are randomly selected from the non-blocked nodes. We run each instancefor three times, and select the run with the best performance. (a) (b) Figure 9: (a) 10x10 open grid map; (b) 10x10 grid with obstacles (0.35 obstacle density, dark nodes are blocked)

Hyperparameters and Neural Network Architecture:

To compare our DCRL+KCO and MAPQN+KCO with DCRL andMAPQN respectively, we use the same hyperparameters as in (Ling, Gupta, and Kumar 2020). The neural network architecturesare also the same except for the last sof tmax layer in DCRL code. Instead, we use a customized layer to generate a probabilitydistribution over all actions according to Equation (2) in the main paper. For comparing the PRIMAL framework with ourapproach, we use the same neural network architecture as in (Sartoretti et al. 2019). We also keep all the hyperparameters samewhen evaluating PRIMAL+KCO. We make a small change to get the ﬁnal set of valid actions that the agent can take: we takeintersection of the set of valid actions given by the PRIMAL environment ( validActions ) with the action set obtained by doing NZ inference ( psddActions ). Concretely, the ﬁnal set of valid actions that an agent can take is validActions ∩ psddActions . Average Total Objective:

We show the plots of average total objective vs average sample count for different settings. Figure 10shows the results for 4x4 with 2 agents, 8x8 with 6 agents, and 10x10 with 10 agents by DCRL and DCRL+KCO. We clearlyobserve that DCRL+KCO converges much faster than DCRL especially on the 10x10 grid. Figure 11 shows the comparisonof MAPQN and MAPQN+KCO for 4x4 with 2 agents and 8x8 with 6 agents. Again, MAPQN+KCO is more sample efﬁcientthan MAPQN. The solution quality is better by MAPQN+KCO as well since all the agents are able to reach their respectivedestinations. Figure 12 shows the results of different approaches on 10x10 grid with obstacles. It clearly shows that DCRL+KCOand MAPQN+KCO are performing much better.

Stranded Agents:

Table 3 shows the stranded agents on different settings. All agents can reach their destinations on all ex-perimental settings by DCRL+KCO and MAPQN+KCO. DCRL performs reasonably well on open grids in terms of strandedagents. However, several agents did not reach the destination on 10x10 grid with obstacles by DCRL. MAPQN performs badlyespecially on 10x10 grid with obstacles. $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H '&5/1 '&5/.&21 (a) 4x4 grid $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H '&5/1 '&5/.&21 (b) 8x8 grid $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H '&5/1 '&5/.&21 (c) 10x10 grid Figure 10:

Sample efﬁciency comparison between DCRL+KCO and DCRL on open grids (N $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H 0$3411 0$341.&21 (a) 4x4 grid $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H 0$3411 0$341.&21 (b) 8x8 grid Figure 11:

Sample efﬁciency comparison between MAPQN+KCO and MAPQN on open grids (higher quality better) $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H '&5/1 '&5/.&21 (a) DCRL+KCO $YHUDJH6DPSOH&RXQW H $ Y H U DJH 7 R W D O 2 E M H F W L Y H H 0$3411 0$341.&21 (b) MAPQN+KCO Figure 12:

Sample efﬁciency results on 10x10 grid with obstacles

Setting DCRL DCRL+KCO MAPQN MAPQN+KCO

Table 3:

Average stranded agents comparisons on different settings (N

References

Amato, C.; Konidaris, G.; Kaelbling, L. P.; and How, J. P. 2019. Modeling and planning with macro-actions in decentralizedPOMDPs.

Journal of Artiﬁcial Intelligence Research

64: 817–859.Becker, R.; Zilberstein, S.; and Lesser, V. 2004. Decentralized Markov decision processes with event-driven interactions. In

International Joint Conference on Autonomous Agents and Multiagent Systems , 302–309. ISBN 1581138644.Becker, R.; Zilberstein, S.; Lesser, V.; and Goldman, C. V. 2004. Solving transition independent decentralized Markov decisionprocesses.

Journal of Artiﬁcial Intelligence Research

22: 423–455. ISSN 10769757.Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio, S. 2019. Neural combinatorial optimization with reinforcementlearning. In

International Conference on Learning Representations, ICLR 2017 - Workshop Track Proceedings .Bernstein, D. S.; Givan, R.; Immerman, N.; and Zilberstein, S. 2002. The complexity of decentralized control of Markovdecision processes.

Mathematics of Operations Research

Advances in Neural InformationProcessing Systems , 3478–3486.hoi, A.; Shen, Y.; and Darwiche, A. 2017b. Tractability in structured probability spaces. In

Advances in Neural InformationProcessing Systems , 3477–3485.Choi, A.; Tavabi, N.; and Darwiche, A. 2016. Structured Features in Naive Bayes Classiﬁcation. In

AAAI .Dai, H.; Khalil, E. B.; Zhang, Y.; Dilkina, B.; and Song, L. 2017. Learning combinatorial optimization algorithms over graphs.In

Advances in Neural Information Processing Systems , 6349–6359.Darwiche, A. 2011. SDD: A new canonical representation of propositional knowledge bases. In

Twenty-Second InternationalJoint Conference on Artiﬁcial Intelligence .Darwiche, A.; Marquis, P.; Suciu, D.; and Szeider, S. 2018. Recent trends in knowledge compilation (Dagstuhl Seminar 17381).In

Dagstuhl Reports , volume 7. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Durfee, E.; and Zilberstein, S. 2013. Multiagent Planning, control, and execution. In Weiss, G., ed.,

Multiagent Systems ,chapter 11, 485–546. Cambridge, MA, USA: MIT Press. ISBN 978-0-262-01889-0.Foerster, J. N.; Farquhar, G.; Afouras, T.; Nardelli, N.; and Whiteson, S. 2018. Counterfactual multi-agent policy gradients. In

AAAI Conference on Artiﬁcial Intelligence , 2974–2982.Fu, H.; Tang, H.; Hao, J.; Lei, Z.; Chen, Y.; and Fan, C. 2019. Deep multi-agent reinforcement learning with discrete-continuoushybrid action spaces. In

International Joint Conference on Artiﬁcial Intelligence , 2329–2335.Hausknecht, M.; and Stone, P. 2015. Deep recurrent Q-learning for partially observable MDPs. In

AAAI Fall Symposium -Technical Report

International Journal on Software Tools for Technology Transfer

Principles ofKnowledge Representation and Reasoning , 558–567.Li, J.; Zhang, H.; Gong, M.; Liang, Z.; Liu, W.; Tong, Z.; Yi, L.; Morris, R.; Pasareanu, C.; and Koenig, S. 2019. Schedulingand Airport Taxiway Path Planning under Uncertainty .Liao, C.-S.; and Huang, Y. 2014. The Covering Canadian Traveller Problem.

Theoretical Computer Science

International Conference on Automated Planning and Scheduling , 551–559.Liu, M.; Ma, H.; Li, J.; and Koenig, S. 2019. Task and Path Planning for Multi-Agent Pickup and Delivery. In

AAMAS ,1152–1160.Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In

Advances in Neural Information Processing Systems , 6380–6391.Ma, H.; Kumar, T. K.; and Koenig, S. 2017. Multi-agent path ﬁnding with delay probabilities. In

AAAI Conference on ArtiﬁcialIntelligence , 3605–3612.Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronousmethods for deep reinforcement learning. In

International conference on machine learning , 1928–1937.Nair, R.; Varakantham, P.; Tambe, M.; and Yokoo, M. 2005. Networked distributed POMDPs: A synthesis of distributedconstraint optimization and POMDPs. In

AAAI Conference on Artiﬁcial Intelligence , volume 1, 133–139.Nguyen, D. T.; Kumar, A.; and Lau, H. C. 2017. Collective multiagent sequential decision making under uncertainty. In

AAAIConference on Artiﬁcial Intelligence , 3036–3043.Nishino, M.; Yasuda, N.; Minato, S.-i.; and Nagata, M. 2016. Zero-suppressed sentential decision diagrams. In

Proceedings ofthe Thirtieth AAAI Conference on Artiﬁcial Intelligence , 1058–1066.Nishino, M.; Yasuda, N.; Minato, S.-i.; and Nagata, M. 2017. Compiling graph substructures into sentential decision diagrams.In

Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence

A Concise Introduction to Decentralized POMDPs .Oztok, U.; and Darwiche, A. 2015. A top-down compiler for sentential decision diagrams. In

Twenty-Fourth International JointConference on Artiﬁcial Intelligence .eshkin, L.; Kim, K.-E.; Meuleau, N.; and Kaelbling, L. P. 2000. Learning to Cooperate via Policy Search. In

Conference inUncertainty in Artiﬁcial Intelligence , 489–496.Rashid, T.; Samvelyan, M.; de Witt, C. S.; Farquhar, G.; Foerster, J. N.; and Whiteson, S. 2018. QMIX: Monotonic ValueFunction Factorisation for Deep Multi-Agent Reinforcement Learning. In

International Conference on Machine Learning ,4292–4301.Sartoretti, G.; Kerr, J.; Shi, Y.; Wagner, G.; Satish Kumar, T. K.; Koenig, S.; and Choset, H. 2019. PRIMAL: Pathﬁnding viaReinforcement and Imitation Multi-Agent Learning.

IEEE Robotics and Automation Letters

Advancesin Neural Information Processing Systems , 3943–3951.Shen, Y.; Choi, A.; and Darwiche, A. 2016b. Tractable operations for arithmetic circuits of probabilistic models. In

Advancesin Neural Information Processing Systems , 3936–3944.Shen, Y.; Goyanka, A.; Darwiche, A.; and Choi, A. 2019. Structured bayesian networks: From inference to learning with routes.In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 33, 7957–7965.Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning withfunction approximation. In

Advances in Neural Information Processing Systems , 1057–1063.Varakantham, P.; Cheng, S. F.; Gordon, G.; and Ahmed, A. 2012. Decision support for agent populations in uncertain andcongested environments. In

AAAI Conference on Artiﬁcial Intelligence , 1471–1477.Volodymyr, M.; Koray, K.; David, S.; Rusu Andrei A; Joel, V.; Bellemare Marc G; Alex, G.; Martin, R.; Fidjeland Andreas K;and Georg, O. 2015. Human-level control through deep reinforcement learning.

Nature