Combining Propositional Logic Based Decision Diagrams with Decision Making in Urban Systems
CCombining Propositional Logic Based Decision Diagrams with Decision Makingin Urban Systems
Jiajing Ling, * Kushagra Chandak, * Akshat Kumar
School of Information SystemsSingapore Management University { jjling.2018, kushagrac, akshatkumar } @smu.edu.sg Abstract
Solving multiagent problems can be an uphill task due to un-certainty in the environment, partial observability, and scal-ability of the problem at hand. Especially in an urban set-ting, there are more challenges since we also need to maintainsafety for all users while minimizing congestion of the agentsas well as their travel times. To this end, we tackle the prob-lem of multiagent pathfinding under uncertainty and partialobservability where the agents are tasked to move from theirstarting points to ending points while also satisfying someconstraints, e.g., low congestion, and model it as a multiagentreinforcement learning problem. We compile the domain con-straints using propositional logic and integrate them with theRL algorithms to enable fast simulation for RL.
The emergence and continued rise of autonomous and semi-autonomous vehicles in the urban landscape has made itsway to a number of areas for transportation and mobility likeself-driving cars and delivery trucks, railways, unmannedaerial vehicles, delivery drones fleet etc. Several key chal-lenges remain to manage such agents like maintaining safety(no collisions among vehicles), avoiding congestion andminimizing travel time to better serve the users and reducepollution. To model such scnarios, we leverage cooperativesequential multiagent decision making, where agents actingin a partially observable and uncertain environment are re-quired to take coordinated decisions towards a long termgoal (Durfee and Zilberstein 2013). Decentralized partiallyobservable MDPs (Dec-POMDPs) provide a rich frameworkfor multiagent planning (Bernstein et al. 2002; Oliehoek andAmato 2016), and are applicable in domains such as ve-hicle fleet optimization (Nguyen, Kumar, and Lau 2017),cooperative robotics (Amato et al. 2019), and multiplayervideo games (Rashid et al. 2018). However, scalability re-mains a key challenge with even a 2-agent Dec-POMDPNEXP-Hard to solve optimally (Bernstein et al. 2002). Toaddress the challenge of scalability, several frameworks havebeen introduced that model restricted class of interactionsamong agents such as transition independence (Becker et al.2004; Nair et al. 2005), event driven and population-based * Figure 1:
Airspace management for drone traffic (Hio 2016) interactions (Becker, Zilberstein, and Lesser 2004; Varakan-tham et al. 2012). Recently, several multiagent reinforce-ment learning (MARL) approaches are developed that pushthe scalability envelop (Lowe et al. 2017; Foerster et al.2018; Rashid et al. 2018) by using simulation-driven opti-mization of agent policies.Key limitations of several MARL approaches includesample inefficiency, and difficulty in learning when rewardsare sparse, which is often the case in problems with com-binatorial flavor. We address such a combinatorial problemof multiagent path finding (MAPF) under uncertainty andpartial observability. Even the deterministic MAPF settingwhere multiple agents need to find collision-free paths fromtheir respective sources to destinations in a shared environ-ment is NP-Hard (Yu and LaValle 2013).The MAPF problem is a general formulation that iscapable of addressing several applications in the domainof urban mobility like autonomous vehicle fleet optimiza-tion (Ling, Gupta, and Kumar 2020; Sartoretti et al. 2019),taxiway path planning for aircrafts (Li et al. 2019), and trainrescheduling (Nygren and Mohanty 2020). Figure 1 showsthe airspace of a city divided into multiple geofenced air-blocks. Such structured airspace can be used by drones tosafely travel to their destinations (Ling, Gupta, and Kumar2020). Since such spaces can have a lot of constraints, theycan be modelled using our framework to manage the traf- a r X i v : . [ c s . A I] N ov c. Deep RL approaches have been applied to MAPF underuncertainty and partial observability (Sartoretti et al. 2019;Ling, Gupta, and Kumar 2020). A key challenge faced byRL algorithms is that it takes several simulations to findeven a single route to destination as model-free RL does notexplicitly exploits the underlying graph connectivity. Fur-thermore, agents can move in cycles, specially during initialtraining episodes, which makes the standard RL approacheshighly sample inefficient. Recent approaches combine un-derlying graph structure with deep neural nets for combina-torial problems such as minimum vertex cover and travelingsalesman problem (Dai et al. 2017; Bello et al. 2019). How-ever, the knowledge compilation framework that we presentprovides much more explicit domain knowledge to RL ap-proaches for MAPF.To address the challenges of delayed rewards, and diffi-culty of finding feasible routes to destinations, we compilethe graph over which agents move in MAPF using proposi-tional logic based probabilistic sentential decision diagrams ( psdd ) (Kisa et al. 2014). A psdd represents probability dis-tributions defined over the models of a given propositionaltheory. We use psdd to represent distribution over all simplepaths (without loops) for a given source-destination pair. Akey benefit is that any random sample from a psdd is gauran-teed to be a valid simple path from the given source to desti-nation. Furthermore, psdd are also equipped with associatedinference methods (Shen, Choi, and Darwiche 2016a) (suchas computing conditional probabilities) that significantly aidRL methods (e.g., given the current partial path, what are thepossible next edges that are guaranteed to lead to the des-tination via a simple path). Using psdd significantly helpsin pruning the search space, and generate high quality train-ing samples for the underlying learning algorithm. However,integrating psdd with different RL methods is challenging,as the standard psdd inference methods are too slow to beused in the simulation-driven RL setting where one needsto query psdd at each time step. Therefore, we also develophighly efficient psdd inference methods that specifically aidRL by enabling fast sampling of training episodes, and aremore than an order of magnitude faster than generic psdd inference. Given that number of paths between a source-destination can be exponential, we also use hierarchical de-composition of the graph to enable a tractable psdd repre-sentation (Choi, Shen, and Darwiche 2017a).To summarize, our main contributions are as follows. First , we compile static domain information such as under-lying graph connectivity using psdd for the MAPF prob-lem under uncertainty and partial observability.
Second ,we develop techniques to integrate such decision diagramswithin diverse deep RL algorithms based on policy gradi-ent and Q-learning.
Third , we develop fast algorithms toquery compiled decision diagrams to enable fast simulationfor MARL. We integrate our psdd -based framework withprevious MARL approaches (Sartoretti et al. 2019; Ling,Gupta, and Kumar 2020), and show that the resulting algo-rithms significantly outperform the original algorithms bothin terms of sample complexity and solution quality on anumber of instances. We also highlight that psdd is a generalframework for incorporating constraints in decision making, and discuss extensions of the standard MAPF that can beaddressed using psdd . A Dec-POMDP is defined using the tuple (cid:104)
S, A, T, O, Z, r, n, γ (cid:105) . There are n agents in environ-ment (indexed using i = 1 : n ). The environment can be inone of the states s ∈ S . At each time step, agent i choosesan action a i ∈ A , resulting in the joint action a ∈ A ≡ A n .As a result of the joint action, the environment transitions toa new state s (cid:48) with probability T ( s, a , s (cid:48) ) . The joint-rewardto the agent team is given as r ( s, a ) . The reward discountfactor is γ < .We assume a partially observable setting in which agent i ’s observation z i ∈ Z is generated using the observationfunction O ( a , s (cid:48) , z i ) = P ( z i | a , s (cid:48) ) where the last joint ac-tion taken was a , and the resulting state was s (cid:48) (for simplic-ity, we have assumed the observation function is the samefor all agents). As a result, different agents can receive dif-ferent observations from the environment.An agent’s policy is a mapping from its action-observation history τ i ∈ ( Z × A ) ∗ to actions or π i ( a i | τ i ; θ i ) ,where θ i parameterizes the policy. Let the discounted futurereturn be denoted by R t = (cid:80) ∞ k =0 γ k r k + t . The joint-valuefunction induced by the joint-policy of all the agents is de-noted as V π ( s t ) = E s t +1: ∞ , a t : ∞ (cid:2) R t | s t , a t (cid:3) , and joint action-value function as Q π ( s t , a t ) = E s t +1: ∞ , a t +1: ∞ (cid:2) R t | s t , a t (cid:3) .The goal is to find the best joint-policy π to maximize thevalue for the starting belief b : V ( π ) = (cid:80) s b ( s ) V π ( s ) . Learning from simulation:
In the RL setting, we do nothave access to transition and observation functions T , O .Instead, multiagent RL approaches (MARL) learn via inter-acting with the environment simulator. The simulator, giventhe joint-action input a t at time t , provides the next environ-ment state s t +1 , generates observation z it +1 for each agent,and provides the reward signal r t . Similar to several previousMARL approaches, we assume a centralized learning anddecentralized policy execution (Foerster et al. 2018; Loweet al. 2017). During centralized training, we assume accessto extra information (such as environment state, actions ofdifferent agents) that help in learning value functions V π , Q π . However, during policy execution, agents rely on theirlocal action-observation history. An agent’s policy π i is typ-ically implemented using recurrent neural nets to conditionon action-observation history (Hausknecht and Stone 2015).However, our developed results are not affected by a partic-ular implementation of agent policies. MARL for MAPF:
MAPF can be mapped to a Dec-POMDP instance in multiple ways to address different vari-ants (Ma, Kumar, and Koenig 2017; Sartoretti et al. 2019;Ling, Gupta, and Kumar 2020). We therefore present theMAPF problem under uncertainty and partial observabil-ity using minimal assumptions to ensure the generality ofour knowledge compilation framework. There is a graph G = ( V, E ) where the set V denotes the locations whereagents can move, and edges connect different locations. Anagent i has a start vertex s i and final goal vertex g i . At anytime step, an agent can be located at a vertex v ∈ V , or 𝑑 (a) 𝑠 𝑑 (b) Figure 2:
Undesirable path samples for MAPF. (a) Path with aloop; (b) Path to a deadend. Dark nodes are blocked. in-transit on an edge ( u, v ) (i.e., moving from vertex u to v ).An agent’s action set is denoted by A = A mov ∪ A oa .Intuitively, A mov denotes actions that intend to change thelocation of agent from the current vertex to a neighboringdirectly connected vertex in the graph (e.g., move up, right,down, left in a grid graph). The set A oa denotes other actionsthat do not intend to change the location of the agent (e.g., noop that intends to make agent stay at the current vertex).Note that we do not make any assumptions regarding the ac-tual transition after taking the action (i.e., move/stay actionsmay succeed or fail as per the specific MAPF instance).Depending on the states of all the agents, an agent i re-ceives observation z i . We assume that an agent is able tofully observe its current location (i.e., the vertex it is cur-rently located at). Other information can also be part of theobservation (e.g., location of agents in the local neighbor-hood of the agent), but we make no assumptions about suchinformation. We make no specific assumptions about thejoint-reward r , other than assuming that an agent prefers toreach its destination as fast as possible if the agent’s move-ment do not conflict with other agents’ movements. Typicalexamples of reward r include penalty for every time step anagent is not at its goal vertex, positive reward at the goalvertex, a high penalty for creating congestion at vertices oredges of the graph (Ling, Gupta, and Kumar 2020), or forblocking other agents from moving to their destination (Sar-toretti et al. 2019). A key challenge for RL algorithms for MAPF is that oftenfinding feasible paths to destinations require a large numberof samples. For example, figure 2(a) shows the case whenan agent loops back to one of its earlier vertex. Figure 2(b)shows another scenario where an agent moves towards adeadend. Such scenarios increase the training episode lengthin RL. Our key intuition is to develop techniques that ensurethat RL approaches only sample paths that are (i) simple, (ii)always originate at the source vertex s i and end at the goalvertex g i for any agent i .Let p it denote the path taken by an agent i until time t (or the sequence of vertices visited by an agent start-ing from source s i ). We also assume that it does not con-tain any cycle. This information can be extracted fromagent’s history τ it . Let a ∈ A mov be a movement actiontowards vertex a v . We assume the existence of a function feasibleActions(p t ; s i , d i ) that takes as input an agent’scurrent path p t and returns the set nextActions = { a ∈ A mov s.t. [p t , a v ] (cid:32) d i } . The condition [p t , a v ] (cid:32) d i im- plies there exists at least one simple path from source s i todestination d i that includes the path segment [p t , a v ] . Thus,starting with p = [ s i ] , the RL approach would only samplesimple paths that are guaranteed to reach an agent’s desti-nation, thereby significantly pruning the search space, andresulting in trajectories that have good potential to generatehigh rewards. The information required for implementing feasibleActions can be compiled offline even before train-ing and execution of policy starts (explained in next section,using decision diagrams), and does not include any com-munication overhead during policy execution. Using thisabstraction, we next present simple and easy-to-implementmodifications to a variety of deep multiagent RL algorithms. Policy gradient based MARL:
We first provide a briefbackground of policy gradient approaches for single agentcase (Sutton et al. 2000). An agent’s policy π θ is parameter-ized using θ . The policy is optimized using gradient ascenton the total expected reward V ( θ ) = E π θ [ R ] . The gradientis given as: ∇ θ V ( θ ) = E s ∞ ,a ∞ (cid:104) ∞ (cid:88) t =0 R t ∇ θ log π θ ( a t | s t ) (cid:105) (1)Above gradient expression is also extendible to the multia-gent case in an analogous manner (Peshkin et al. 2000; Fo-erster et al. 2018). In multiagent setting, we can computegradient of the joint-value function V w.r.t. an agent i ’s pol-icy parameters θ i or ∇ θ i V . The expectation is w.r.t. the jointstate-action trajectories E s ∞ , a ∞ , and R t denotes futurereturn for the agent team. The input to policy are some fea-tures of the agent’s observation history or φ ( τ i ) . The func-tion φ can be either hard-coded (e.g., only last two observa-tions), or can be learned using recurrent neural networks.For using compiled knowledge using the function feasibleActions , the only change we require is in the struc-ture of an agent’s policy π (we omit superscript i forbrevity). The main challenge is addressing the variable sizedoutput of the policy in a differentiable fashion. Assuminga deep neural net based policy π , given the discrete actionspace A , the last layer of the policy has | A | outputs usingthe softmax layer (to normalize action probabilities π ( a |· ) ).However, when using feasibleActions , the probability of ac-tions not in feasibleActions needs to be zero. However, theset feasibleActions changes as the observation history τ ofthe agent is updated. Therefore, a fixed sized output layerappears to create difficulties. However, we propose an easyfix. We use ˜ π to denote the standard way policy π is con-structed with last layer having fixed | A | outputs. However,we do not require the last layer to be a softmax layer. In-stead, we re-define the policy π as: π ( a | τ ) = if a / ∈ feasibleActions(p( τ ); s, d ) else exp (cid:0) ˜ π ( a | φ ( τ )) (cid:1) (cid:80) a (cid:48)∈ feasibleActions(p( τ ); s,d ) exp (cid:0) ˜ π ( a (cid:48) | φ ( τ )) (cid:1) (2) where p( τ ) denotes the path taken by the agent so far, and s, d are its source and destination. Sampling from π guar-antees that invalid actions are not sampled. Furthermore, π s differentiable even when feasibleActions gives differentlength outputs at different time steps. The above operationcan be easily implemented in autodiff libraries such as Ten-sorflow without requiring a major change in the policy struc-ture π . Q-learning based MARL:
Deep Q-learning for the sin-gle agent case (Volodymyr et al. 2015) has been extendedto the multiagent setting also (Rashid et al. 2018). Inthe QMIX approach (Rashid et al. 2018), the joint action-value function Q tot ( τ , a ; ψ ) is factorized as (non-linear)combination of action-value functions Q i ( τ i , a i ; θ i ) of eachagent i . A key operation when training different param-eters θ i and ψ involves maximizing max a Q tot ( τ , a ; φ ) (for details we refer to Rashid et al.). This operationis intractable in general, however, under certain condi-tions, it can be approximated by maximizing individ-ual Q functions max a ∈ A Q i ( τ i , a i ) in QMIX. We requiretwo simple changes to incorporate our knowledge com-pilation scheme in QMIX. First, instead of maximizingover all the actions, we maximize only over feasible ac-tions of an agent as max a ∈ feasibleActions p( τ i ; s i ,d i ) Q i ( τ i , a ) .Second, in Q-learning, typically a replay buffer is alsoused which stores samples from the environment as ( τ , a , τ (cid:48) , r ) . In our case, we also store additionally theset of feasible actions for the next observation history τ (cid:48) i for each agent i as feasibleActions(p( τ (cid:48) i ); s i , d i ) alongwith the tuple ( τ , a , τ (cid:48) , r ) . The reason is when this tu-ple is replayed , we have to maximize Q i ( τ (cid:48) i , a ) over a ∈ feasibleActions(p( τ (cid:48) i ); s i , d i ) , and storing the set feasibleActions(p( τ (cid:48) i ); s i , d i ) would reduce computation.We have integrated our knowledge compilation frame-work with two policy gradient approaches proposed in (Sar-toretti et al. 2019; Ling, Gupta, and Kumar 2020) (one usingfeedforward neural net, another using recurrent neural net-work based policy), and a QMIX-variant (Fu et al. 2019)for MAPF, demonstrating the generalization power of theframework for a range of MARL solution methods. We now present our decision diagram based approach to im-plement the feasibleActions function. Let upper case letters( X ) denote variables and lowercase letters ( x ) denote theirinstantiations. Bold upper case letter ( X ) denotes a set ofvariables and their lower case counterparts ( x ) denote theinstantiations. Paths as a Boolean formula:
A path p from a given source s to the destination d in the underlying undirected graph G = ( V, E ) can be represented as a Boolean formula asfollows. Consider Boolean random variables X i,j for eachedge ( i, j ) ∈ E . If an edge ( i, j ) occurs in p , then X i,j isset to true, otherwise it’s set to false. Hence, conjunction ofthese literals denotes path p , and the Boolean formula repre-senting all paths is obtained by simply disjoining formulasfor all such paths (Choi, Tavabi, and Darwiche 2016). Anexample path in a graph is given in fig 3(a). Sentential decision diagrams:
Since the number of pathsbetween two nodes can be exponential, we need a compact representation of the Boolean formula representing paths. Tothis end, we use sentential decision diagram or sdd (Dar-wiche 2011). It is a Boolean function f ( X , Y ) on some non-overlapping variable sets X , Y and is written as a decom-position in terms of functions on X and Y . In particular, f = ( p ( X ) ∧ s ( Y )) ∨ ... ∨ ( p n ( X ) ∧ s n ( Y )) , with each element ( p i , s i ) of the decomposition composed of a prime p i and a sub s i . A sdd represented as a decision diagramdescribes members of a combinatorial space (e.g., paths in agraph) using propositional logic in a tractable manner. It hastwo kinds of nodes:- terminal node , which can be a literal ( X or ¬ X ), alwaystrue ( (cid:62) ) or always false ( ⊥ ), and- decision node , which is represented as ( p ∧ s ) ∨ ... ∨ ( p n ∧ s n ) where all ( p i , s i ) pairs are recursively sdd s andthe primes are always consistent, mutually exclusive andexhaustive.Figure 3(b) represents an sdd for the graph in fig 3(a)encoding all paths from n1 to n5. The encircled node isa decision node with two elements ( D, E ) and ( ¬ D, ⊥ ) .The primes are D and ¬ D and the subs are E and ⊥ . TheBoolean formula representing this sdd node is ( D ∧ E ) ∨ ( ¬ D ∧ ⊥ ) which is equivalent to D ∧ E . The Boolean for-mula encoded by the whole sdd is given by the root node ofthe sdd .An sdd is characterized by a full binary tree, called a vtree , which induces a total order on the variables from aleft-right traversal of the vtree. E.g., for the vtree in fig-ure 3(c), the variable order is ( A, B, C, D, E ) . Given a fixedvtree, the sdd is unique. An sdd node n is normalized (orassociated with) for a vtree node v as follows:- If n is a terminal node, then v is a leaf vtree node whichcontains the variable of n (if any).- If n is a decision node, then n ’s primes (subs) are normal-ized for the left (right) child of v .- If n is the root node, then v is the root vtree node.Intuitively, a decision node n being normalized for vtreenode v implies that the Boolean formula encoded by n con-tains only those variables contained in the sub-tree rooted at v . We will use this normalization property for our analysislater. The Boolean formula encoding the domain knowledgecan be compiled into a decision diagram using the sdd com-piler (Oztok and Darwiche 2015). The resulting sdd maynot be exponential in size even though it is representing anexponential number of objects. Probabilistic Sentential decision diagrams:
In our case,for computing feasibleActions , we also need to associatea probability distribution with the sdd that encodes all thepaths from a given source to destination. The key benefitis that we can exploit associated inference methods such ascomputing conditional probabilities, which will help in com-puting feasibleActions .If we parameterize each of the decision nodes of the sdd ,such that the local parameters form a distribution, the result-ing probabilistic structure is called a psdd or a probabilis-tic sdd (Kisa et al. 2014). It can be used to represent discrete (a) D E ¬D ⊥ ¬DE D ⊥ ¬C C ⊥ C ¬C ⊥ B ¬B A ¬A ⊥ (b) A B C D E (c) D E ¬D ⊥ ¬DE D ⊥ ¬C C ⊥ C ¬C ⊥ B ¬B A ¬A ⊥ (d) Figure 3: (a) A simple path in a graph from s = n to d = n is highlighted in red and can be written as a propositional sentence A ∧ C ∧ E ∧ ¬ B ∧ ¬ D ; (b) An sdd for the graph in (a) where the encircled node represents a decision node ( p , s ) , ( p , s ) (c) a right-linear vtree for the sdd ; (d) psdd with parameters annotated on decision nodes. probability distributions P r ( X ) where several instantiations x have zero probability P r ( x ) = 0 because of the constraintsimposed on the space. More concretely, a psdd normalized for an sdd is defined as follows:- For each decision node ( p i , s i ) , ... , ( p n , s n ) , there’s a pos-itive parameter θ i such that (cid:80) ni =1 θ i = 1 and θ i = 0 iff s i = ⊥ .- For each terminal node (cid:62) , there’s a parameter < θ < . psdd s are tractable models of probability distributions asseveral probabilistic queries can be performed in poly-timesuch as computing marginal probabilities, or conditionalprobabilities. NZ (Non-Zero) Inference for feasibleActions : Givenan sdd encoding all simple paths from a source s to a des-tination d , we uniformly parameterize this sdd as noted ear-lier. That is, for a decision node ( p i , s i ) , ... , ( p n , s n ) , each θ i is the same (except when s i = ⊥ , then θ i = 0 ). And wealso enforce that non-zero θ i s normalize to . This strategymakes sure that the probability of each simple path from s to d is non-zero. Assume that the current sampled path by theagent is p (in the context of psdd , we assume that p is a set ofedges in graph G traversed from source s by the agent). Let v p denote the current vertex of the agent (and assume v p isnot the destination). Let Nb( v p ) denote all direct neighborsof v p . The feasibleActions set is given as: feasibleActions(p) = { v (cid:48) ∈ Nb( v p ) ∧ ( v p , v (cid:48) ) / ∈ p ∧ P r (( v p , v (cid:48) ) | p) > } (3) That is, if the conditional probability
P r (( v p , v (cid:48) ) | p) = 0 ,then v (cid:48) can be pruned from the action set as it implies thereis no simple path to destination d that takes the edge ( v p , v (cid:48) ) after taking the path p . This strategy seems straightforwardto implement as psdd is equipped with inference methods tocompute conditional probabilities. However, in RL, this in-ference needs to be done at each time step for each trainingepisode. We observed empirically that this method was ex-tremely slow, and it was impractical to scale it for multiple agents. We therefore next develop our customized inferencetechnique that is much faster than this generic inference. Sub-context connectivity analysis for NZ Inference:
Wenote that all the discussion below is for a psdd that encodesall simple paths from a source s to destination d , and the psdd is normalized for some right linear vtree . Proofs fordifferent results are provided in the supplementary materialin the full paper available on Arxiv. Lemma 1.
In a psdd normalized for a right linear vtree,each prime is a literal ( X or ¬ X ) or (cid:62) . The above result is a direct consequence of the manner inwhich the underlying sdd is constructed using a right linearvtree.We sample a path from such a psdd by traversing it in atop-down fashion and selecting one branch at a time for eachof the decision nodes according to the probability for thatbranch and then selecting the prime and recursively goingdown the sub (Kisa et al. 2014). As all the prime nodes areterminal as per lemma 1, if the prime node is a positive literal X , then we select the edge e corresponding to X for ourpath (say e X ). If prime node is ¬ X , then we do not selectedge e X . We show in the supplement that the prime nodesencountered during such sampling procedure for a psdd thatencodes simple paths cannot be (cid:62) .As an example, consider the graph in fig 3(a) and its cor-responding psdd in fig 3(d). We start at the root of the psdd and select the left branch with probability 1. We then se-lect the prime A in our sample and recursively go downits sub as shown by the red arrows. The final sampled pathis A − C − E and the corresponding Boolean formula is A ∧ ¬ B ∧ C ∧ ¬ D ∧ E . Definition 1. (S-Path) Let n be a psdd node normalizedeither for ˜ v l and ˜ v r , the two deepest vtree nodes. Let ( p , s ) , ... , ( p k , s k ) be the elements appearing on somepath from the psdd root to node n (i.e., n = p k or n = s k ).Then p ∧ ... ∧ p k ∧ n is called an s-path for node n , and isfeasible iff s i (cid:54) = ⊥ . n figure 3(c), ˜ v l is D , and ˜ v r is E . There can be multiples-paths for a node n . Let spset denote the set of all feasible s-paths for all psdd nodes n normalized either for ˜ v l or ˜ v r . Lemma 2.
There is a one-to-one mapping between s-pathsin the set spset and the set of all simple paths in G fromsource s to destination d . The above lemma states that if we find a feasible s-pathin the psdd , then it would correspond to a valid simple pathfrom source s to destination d in the graph G which will alsohave nonzero probability as per our psdd . Reading off thepath in G given a feasible s-path is straightforward. A fea-sible s-path is also a conjunction of literals (using lemma 1,and if n is a sub, it will also be a literal as n is normalized fordeepest node in vtree). For each positive literal X in s-path,we include its corresponding edge e X in the path in G . Theset of resulting edges would form a simple path in G .This result also provides a strategy for our fast NZ infer-ence. Given a path p in graph G , our goal is to find whether P r (( v p , v (cid:48) ) | p) > . If we can prove that there exists an s-path sp ∈ spset such that its corresponding path in graph G (using lemma 2) contains all the edges in p and ( v p , v (cid:48) ) , then P r (( v p , v (cid:48) ) | p) must be nonzero. We need few additional re-sults below to turn this insight into an efficient algorithmicprocedure. Definition 2. (Sub-context (Kisa et al. 2014)) Let ( p , s ) , ... , ( p k , s k ) be the elements appearing on somepath from the sdd root to node n (i.e., n = p k or n = s k ).Then p ∧ ... ∧ p k is called a sub-context sc for node n , andis feasible iff s i (cid:54) = ⊥ . Notice that a psdd node n can have multiple (feasible)sub-context as a psdd is a directed acyclic graph (DAG). Es-sentially, each sub-context corresponds to one possible wayof reaching node n from the psdd root. For a right linearvtree, a feasible sub-context is a conjunction of literals as allprimes are literals (lemma 1).Given two psdd nodes n and n (cid:48) , we say that n (cid:48) is deeper than n if the vtree node v (cid:48) for which n (cid:48) is normalized isdeeper than vtree node v for which n is normalized. Definition 3. (Sub-context set) Let X be a positive literal,and let p i , ... , p i k be psdd prime nodes such that each p i = X . Let ssc ... ssc k be sets such that each ssc j contains allthe feasible sub-contexts of p i j . Then the sub-context set of X denoted by sset( X ) is defined as sset( X ) = ∪ kj =1 ssc j We now show the procedure to perform sub-context con-nectivity analysis for NZ inference. Assume that the currentsampled path in graph G is p = { e , ... , e k } (each e i is theedge traversed by the agent so far). Let the current vertex ofthe agent be v p . Let e = ( v p , v (cid:48) ) be one possible edge in G that the agent can traverse next. Let X e , ... , X e k , X e bethe respective Boolean variables for the different edges. Wewish to determine whether P ( X e | X e , ... , X e k ) is greaterthan zero. We follow the following steps to determine this.1. Find the variable ˜ X ∈ { X e , ... , X e k , X e } that is deepestin the vtree order. shorthand for P ( X e = 1 | X e = 1 , ... , X e k = 1)
2. Check if there exists a sub-context sc ∈ sset( ˜ X ) suchthat sc contains all the positive literals { X e , ... , X e k , X e } .Concretely, check if ∃ sc ∈ sset( ˜ X ) s.t. sc ∧ X = sc , ∀ X ∈{ X e , ... , X e k , X e } . Denote this sub-context sc ∗ (if exists).3. Since sc ∗ is the sub-context of the variable deepest in thevtree order among { X e , ... , X e k , X e } , it can be extendedto a feasible s-path sp ∈ spset such that sp contains sc ∗ (or sc (cid:63) ∧ sp = sp ). (Proved formally in supplemen-tary). Therefore, we have shown the existence of a feasi-ble s-path sp that contains all literals { X e , ... , X e k , X e } ,and by lemma 2, there also exists a simple path ingraph G that contains the edges { e , ... , e k , e } . Therefore, P ( X e | X e , ... , X e k ) is non-zero.4. If sc ∗ does not exist, then a feasible s-pathcannot be found containing all the literals { X e , ... , X e k , X e } (proved in supplementary). There-fore, P ( X e | X e , ... , X e k ) is zero.Step number 2 in the method above is computationally themost challenging. We develop additional results in the sup-plementary material that further optimize this step, resultingin a fast and practical algorithm for NZ inference. Hierarchical clustering for large graphs:
For increasingthe scalability of the psdd framework and NZ inference forlarge graphs, we take motivation from (Choi, Shen, and Dar-wiche 2017b; Shen et al. 2019). These previous results showthat by suitably partitioning the graph G among clusters,we can keep the size of the psdd tractable even for verylarge graphs. Such partitioning does result in the loss of ex-pressiveness as the psdd for the partitioned graph may omitsome simple paths, but empirically, we found that this par-titioning scheme still improved efficiency of the underlyingRL algorithms significantly. This partitioning method is de-scribed in the supplementary material in more detail. The framework that we presented can be used to compile anumber of different kinds of constraints. For example, theagent has to first go to a pickup location and then to a deliv-ery location (Liu et al. 2019), or TSP-like constraints wherethe agent has to visit some locations before reaching the des-tination while avoiding collisions. An example is explainedbelow in more detail.
Landmark Constraints:
This framework can be extendedto settings where an agent is required to visit some land-marks before reaching the destination. We can construct theBoolean formula representing such a constraint by taking in-cident edge variables for each of the landmarks and allowingat least one of them to be true. We can then multiply (Shen,Choi, and Darwiche 2016b) the PSDD representing such aformula with the PSDD representing simple path constraint.For example, if n i is a node representing a landmark and A, B, C, D are Boolean variables representing the edges in-cident on n i , then we can represent the constraint for n i as β i = A ∨ B ∨ C ∨ D . For k such landmarks, we can simi-larly represent the constraints β , . . . , β k . Then the Booleanformula for all the landmarks would be β = (cid:86) ki =1 β i andan be compiled as a PSDD. Now, if α is a PSDD represent-ing simple paths between a source and a destination, thenwe can multiply α and β to get the final PSDD representingsimple paths where an agent is required to visit some land-marks before the destination. This strategy can be scaled upby hierarchical partitioning of the graph (Choi, Shen, andDarwiche 2017b; Shen et al. 2019) and can be used to repre-sent complex constraints by multiplying them. This processis also modular since the constraints are separately modeledfrom the underlying graph connectivity.Furthermore, this framework can also be used in caseswhere the underlying graph connectivity is dynamic; e.g.,in scenarios where edges are dynamically getting blockedover time or the graph is revealed with time like the Cana-dian Traveller Problem (Liao and Huang 2014). Any obser-vation about blocked edges at a time can become the evi-dence, and by conditioning on this evidence, the agent canrule out routes via such blocked edges. The generalizabil-ity and flexibility of this framework make it a promisingapproach in combining domain knowledge with models forRL, pathfinding, and other areas. We present results to show how the integration of our frame-work with previous multiagent deep-RL approaches basedon policy gradient and Q-learning (Sartoretti et al. 2019;Ling, Gupta, and Kumar 2020) performs better in MAPFproblems in terms of both sample efficiency and solutionquality on a number different maps with different numberof agents.
Simulation Speed:
We show comparisons between ourmethod and psdd inference method for calculating marginalprobabilities. Our approach is more than an order of magni-tude faster.
Approach No Clustering Clustering psdd inference 26.55 158.41 979.71 402665.98Table 1:
Simulation speed comparison (in seconds)
Open Grid Maps:
We next evaluate the integration of ourknowledge-based framework with policy gradient and Q-learning based approaches. We combine our framework withDCRL (Ling, Gupta, and Kumar 2020) and MAPQN (Fuet al. 2019) on several open grid maps with varying num-ber of agents. DCRL is a policy gradient based algorithm,and MAPQN is a Q-learning based algorithm. We followthe same MAPF model as (Ling, Gupta, and Kumar 2020)where each node has its own capacity (maximum numberof agents that can be accommodated), and agents can takemultiple time steps to move between two contiguous nodes.The total objective is to minimize sum of costs (SOC) of allagents combined with penalties for congestion. More detailson the experiments, the neural network structure and the hy-perparameters are noted in the supplementary material.The environment setting is varying from 4x4, 2 agents up to 10x10, 30 agents. We generated 10 instances for each set-ting. In each setting, we follow (Ling, Gupta, and Kumar2020) to randomly select the sources and destinations and tospecify the capacity of each node. We also specify the minand max time ( t min , t max ) to move between two nodes. Werun for each instance three times, and we terminate the runseither after 500 iterations or 10 hours. Each episode has amaximum length of 500 steps. For each instance, we choosethe run with the best performance. We compute the total ob-jective averaged over all agents and the cumulative numberof samples averaged over all agents during training. Finally,we plot the average total objective vs the average cumulativesample count over all instances.Figure 4 shows the results comparing DCRL with Knowl-edge Compilation (DCRL+KCO) and DCRL on 4x4, 8x8,and 10x10 grids (plots for 4x4, 2 agents, 8x8, 6 agents, and10x10, 10 agents are deferred to the supplementary). Al-though all agents are able to reach their respective destina-tions (no stranded agents) in both DCRL+KCO and DCRL,agents are trained to reach destinations cooperatively withsignificantly fewer samples in DCRL+KCO. It means thatagents are exploring the environment more efficiently inDCRL+KCO than in DCRL especially during the initial fewtraining episodes. This is also reflected in the plot as the av-erage total objective in DCRL+KCO is significantly higherduring initial training phase compared to DCRL.Figure 5 shows the comparison of sample efficiency be-tween MAPQN+KCO and MAPQN on 4x4 and 8x8 grids.We did not evaluate MAPQN+KCO on 10x10 grid sinceMAPQN itself is not able to train a large number of agentson large grid maps (more details in (Ling, Gupta, and Kumar2020)). We observe that MAPQN+KCO converges fasterand to a better quality than MAPQN especially on 8x8 grid.This is because several agents did not reach their destinationwithin the episode cutoff in MAPQN, in contrast, all agentsreach their destination in MAPQN+KCO. Obstacles:
We evaluate KCO with DCRL and MAPQN ona 10x10 obstacle map with varying number of agents (from2 agents up to 10 agents). The obstacles are randomly gen-erated with density 0.35. We generate 10 instances for thissetting. For each instance, sources and destinations are ran-domly generated from the non-blocked nodes from the topand bottom rows (each source and destination pair is guar-anteed to be reachable). Other parameters are specified inthe same way as the above experiments. This set of experi-ments is quite challenging especially when there are severalagents since they can go into dead ends easily while cooper-ating with each other to reduce the congestion level. Figure 6clearly shows that DCRL and MAPQN can converge muchfaster with the integration of KCO and confirms that our ap-proach is more sample efficient. Specifically for MAPQN,several agents did not reach their destinations (8.8 agentson average, for N10 case), whereas in MAPQN+KCO, allagents reached destination, which explains much better so-lution quality by MAPQN+KCO.We also evaluate KCO integrated with the PRIMALframework (Sartoretti et al. 2019) which is based on asyn-chronous advantage actor-critic or A3C (Mnih et al. 2016)combined with imitation learning. We test it on a 10x10 map $ Y H U D J H 6 D P S O H &