On Algorithmic Decision Procedures in Emergency Response Systems in Smart and Connected Communities
Geoffrey Pettet, Ayan Mukhopadhyay, Mykel Kochenderfer, Yevgeniy Vorobeychik, Abhishek Dubey
OOn Algorithmic Decision Procedures in Emergency ResponseSystems in Smart and Connected Communities
Geoffrey Pettet
Vanderbilt UniversityNashville, [email protected]
Ayan Mukhopadhyay
Stanford UniversityPalo Alto, [email protected]
Mykel Kochenderfer
Stanford UniversityPalo Alto, [email protected]
Yevgeniy Vorobeychik
Washington UniversitySt Louis, [email protected]
Abhishek Dubey
Vanderbilt UniversityNashville, [email protected]
ABSTRACT
Emergency Response Management (ERM) is a critical problem facedby communities across the globe. Despite this, it is common for ERMsystems to follow myopic decision policies in the real world. Prin-cipled approaches to aid ERM decision-making under uncertaintyhave been explored but have failed to be accepted into real systems.We identify a key issue impeding their adoption — algorithmicapproaches to emergency response focus on reactive, post-incidentdispatching actions, i.e. optimally dispatching a responder after in-cidents occur. However, the critical nature of emergency responsedictates that when an incident occurs, first responders always dis-patch the closest available responder to the incident. We argue thatthe crucial period of planning for ERM systems is not post-incident,but between incidents. This is not a trivial planning problem — amajor challenge with dynamically balancing the spatial distribu-tion of responders is the complexity of the problem. An orthogonalproblem in ERM systems is planning under limited communication,which is particularly important in disaster scenarios that affectcommunication networks. We address both problems by proposingtwo partially decentralized multi-agent planning algorithms thatutilize heuristics and exploit the structure of the dispatch problem.We evaluate our proposed approach using real-world data, and findthat in several contexts, dynamic re-balancing the spatial distribu-tion of emergency responders reduces both the average responsetime as well as its variance.
KEYWORDS
Emergency Response Management; Monte Carlo Tree Search; De-centralized Algorithms; Smart and Connected Communities
ACM Reference Format:
Geoffrey Pettet, Ayan Mukhopadhyay, Mykel Kochenderfer, Yevgeniy Vorob-eychik, and Abhishek Dubey. 2020. On Algorithmic Decision Procedures inEmergency Response Systems in Smart and Connected Communities. In
Proc. of the 19th International Conference on Autonomous Agents and Mul-tiagent Systems (AAMAS 2020), Auckland, New Zealand, May 9–13, 2020,
IFAAMAS, 9 pages.
Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.), May9–13, 2020, Auckland, New Zealand
Emergency response management (ERM) is a critical problem facedby communities across the globe. First responders must attend tomany incidents dispersed across space and time using limited re-sources. ERM can be decomposed into the following sub-problems— forecasting, planning, and dispatching. Although these have beenexamined independently, planning and dispatch decisions are de-pendent on accurate incident forecasting. Therefore, it is imper-ative that principled approaches are designed to tackle all threesub-problems. However, it is fairly common for ERM systems to fol-low myopic and straight-forward decision policies. For decades, themost common dispatching approach was to send the closest avail-able responder to the incident (in time or space), after which theresponder would return to its base or be reassigned. Such methodsdo not necessarily minimize expected response times [15]. As citiesgrow, population density, traffic dynamics and the sheer frequencyof incidents make such methods stale and inaccurate. We systemat-ically investigate the nuances of algorithmic approaches to ERMand describe how principled decision-making can aid emergencyresponse.Naturally, algorithmic approaches to emergency response typi-cally combine a data-driven forecasting model to predict incidentswith a decision-making process that provides dispatch recommenda-tions. Canonical approaches towards modeling the decision processinvolve using a Continuous-Time Markov Decision Process (CT-MDP)[11] or a Semi-Markovian Process (SMDP)[16], which aresolved through dynamic programming. While the SMDP modelprovides a more accurate representation of ERM dynamics, it doesnot scale well for dynamic urban environments[13]. The trade-offbetween optimality and computational time has also been investi-gated by the use of Monte-Carlo based methods[13].Despite such algorithmic progress and attention in recent yearsfrom the AI community[4, 13, 16, 18, 19, 22], there are still issuesthat impede the adoption of principled algorithmic approaches. Weargue that a major problem lies in the very focus of most algo-rithmic approaches. Most ERM systems seek to perform decision-making after incidents occur. While such approaches guaranteeoptimality in the long run (with respect to response times), theyde-prioritize response to some incidents. Our conversations withfirst-responders[6] revealed two crucial insights about this problem:1) it is almost impossible to gauge the severity of an incident from a r X i v : . [ c s . A I] M a r call for assistance and de-prioritize immediate response in antici-pation of higher future rewards, and 2) Computer-Aided Dispatchsystems (CAD)[21] typically enable a human agent to dispatch aresponder in the span of 5-10 seconds. These insights explain whythe closest responder is usually dispatched to an incident; it is toorisky to de-prioritize incidents of unknown severity.We raise an important conceptual question about algorithmicapproaches to emergency response - is it feasible to optimize overdispatch decisions once an incident has happened? In this paperwe argue that the crucial, practical period of principled decision-making is between incidents. This avoids the potential consequencesof explicitly choosing to de-prioritize response to an incident toachieve future gain, but accommodates the scope of principleddecision-making. Most ERM systems do not exploit the scope ofdynamically rebalancing the spatial distribution of responders ac-cording to the need of the hour. This problem is challenging sinceoptimizing responder distribution and response as a multi-objectiveoptimization problem is usually computationally infeasible. Indeed,even Monte-Carlo based methods have previously been used witha restricted action space (only responding to incidents) to achieveacceptable computational latency[13]. We address this challengeby proposing two efficient algorithmic approaches to optimize overthe spatial distribution of responders dynamically.The second set of problems that impedes the adoption of algorith-mic decision-making in ERM is related to resilience and efficiency.Data processing and decision-making for algorithmic dispatchingusually occur in a centralized manner (typically at a central dataprocessing center), which is then communicated to responders.ERM, however, clearly evolves in a multi-agent setting, in whichthe agents have the capacity to perform independent computa-tion (most modern ambulances are equipped with laptops). In anextremely time-critical setting, especially during communicationbreakdowns often caused by disasters, it is crucial that such comput-ing abilities are used, and distributed and parallelized algorithmicframeworks are designed. Also, centralized decision-making sys-tems treat all agents as part of a monolithic object or state. Thisis redundant, as agents often operate independently (for example,an ambulance in one part of the city is usually not affected by anincident in a completely different or distant part). In this paper,we argue that decentralized planning could identify and utilizestructure in the problem and save vital computational time. Contributions:
We focus on two problems in this paper (1) de-signing an approach that can accommodate rebalancing of resourcesto ensure efficient response, and (2) designing the ability for anemergency response system to be equipped to deal with scenar-ios that require decentralized planning with very limited commu-nication. To this end, we start by modeling the problem of op-timal response as a Multi-Agent Semi-Markov Decision Process(M-SMDP)[2, 9]. Then, we describe a novel algorithmic approachbased on Multi-Agent Monte-Carlo Tree Search (M-MCTS)[3] thatfacilitates parallelized planning to dynamically rebalance the spatialdistribution of responders. Our approach utilizes the computationcapacity of each individual agent to create a partially decentralizedapproach to planning. Finally, we evaluate our framework usingreal-world data from Nashville, TN. We find that these approaches
Table 1: Notation lookup table
Symbol Definition Λ Set of agents D Set of depots C( d ) Capacity of depot dG Set of cells S State space A Action space P State transition function T Temporal transition distribution α Discount factor ρ ( s , a ) Reward function given action a taken in state s A Joint agent action space T Termination scheme s t Particular state at time tI t Set of cell indices waiting to be serviced R t Set of agent states at time tp tj Position of agent jд tj Destination of agent ju tj Current status of agent js i , s j Individual states t ij Transition time between states s i , s j t a Time between incidents t s Time to service an incident t b Time to a balance step r Reward t r Incident response time (the time between incidentawareness and first agent’s arrival on scene) σ Action recommendation set µ Mean agent service time c d Number of agents at depot dυ д Incident rate at cell дυ dд The fraction of cell д ’s incident rate shared by depot d ϒ Set of occupied depots and their split incident rates π ϒ Utility of ϒ t h Time since beginning of planning horizon t r ( s , a ) Response time to an incident given action a in state sϕ k ( s , a ) Distance traveled by agent k while balancing ψ Exogenous parameter balancing response timeand distance traveledRoI Radius of Influence of a cell (used in queue based rebalancing policy). maintain system fairness while decreasing the average and vari-ance of incident response times when compared to the standardprocedure.
Outline:
Through the rest of the paper, we describe the over-all problem of emergency response and explain the algorithmicframework. We begin by providing a brief background regardinghow ERM pipeline can be modeled technically, and how theoreticalapproaches to solution work in such situations. Then, we describeour algorithmic framework in detail, and finally, evaluate our frame-work using incident and response data from Nashville, TN. Table 1can be used as a reference for the symbols we use.
Our goal is to develop an approach for emergency responder place-ment and incident response in a dynamic, continuous-time andstochastic environment. We begin with several assumptions onthe problem structure and information provided a-priori . First, weassume that we are given a spatial map broken up into a finite col-lection of equally-sized cells G , and that we are given an exogenousspatial-temporal model of incident arrival in continuous time overthis collection of cells (we describe one such model later). Second,we assume that for each spatial cell, the temporal distribution ofincidents is homogeneous. Our third assumption is that emergencyresponders are allowed to be housed in a set of fixed and exoge-nously specified collection of depots D . Depots are essentially asubset of cells that responders can wait in, and are analogous to re-stations in the real-world. Each depot d ∈ D has a fixed capac-ity C( d ) of responders it can accommodate at a time. We assumethat when an incident happens, a free responder (if available) isdispatched to the site of the incident. Once dispatched, the time toservice consists of two parts: 1) time taken to travel to the scene ofthe incident, and 2) time taken to attend to the incident. If no freeresponders are available, then the incident enters a waiting queue. An important component of a decision-theoretic framework toaid emergency response is the understanding of when and where incidents occur. While our algorithmic framework can work withany forecasting model, we briefly describe the one that we choose touse: a continuous-time forecasting model based on survival analysis.It has recently shown state-of-the-art performance in predictionperformance for a variety of spatial-temporal incidents (crimes,traffic accidents etc.)[15, 17, 18]. Formally, the model represents aprobability distribution over inter-arrival times between incidents,conditional on a set of features, and can be represented as f t ( T = t | γ ( w )) where f t is a probability distribution for a continuous randomvariable T representing the inter-arrival time, which typically de-pends on covariates w via the function γ . The model parameters canbe estimated by the principled procedure of Maximum LikelihoodEstimation (MLE) [10]. The evolution of incident arrival and emergency response occurin continuous-time, and can be cohesively represented as a Semi-Markov Decision Process (SMDP) [16]. An SMDP system can bedescribed by the tuple ( S , A , P , T , ρ ( i , a ) , α ) where S is a finite statespace, A is the set of actions, P is the state transition function with p ij ( a ) being the probability with which the process transitions fromstate i to state j when action a is taken, T denotes the temporaltransition with t ( i , j , a ) representing a distribution over the timespent during the transition from state i to state j under action a , ρ represents the reward function, and α is the discount factor.To adapt this formulation to a multiagent setting, we modelthe evolution of incidents and responders together in a Multi-Agent SMDP (MSMDP)[20], which can be represented as the tuple ( Λ , S , A , P , T , ρ ( i , a ) , α , T ) , where Λ is a finite collection of agentsand λ j ∈ Λ denotes the j th agent. The action space of the j th agentis represented by A j , and A = (cid:206) mi = A j represents the joint ac-tion space. We assume that the agents are cooperative and workto maximize the overall utility of the system. The components S , ρ and P are defined as in a standard SMDP. T represents a termina-tion scheme; note that since agents each take different actions thatcould take different times to complete, they may not all terminateat the same time. An overview of such schemes can be found inprior literature [20]. We focus on asynchronous termination, whereactions for a particular agent are chosen as and when the agentcompletes it’s last assigned action. Next, we define the importantcomponents of the decision process in detail. States:
A state at time t is represented by s t which consists of atuple ( I t , R t ) , where I t is a collection of cell indices that are waiting to be serviced, ordered according to the relative times of incidentoccurrence. R t corresponds to information about the set of agentsat time t with | R t | = | Λ | . Each entry r tj ∈ R t is a set { p tj , д tj , u tj } ,where p tj is the position of responder λ j , д tj is the destination cellthat it is traveling to (which can be its current position), and u tj isused to encode its current status (busy or available), all observedat the state of our world at time t . For the sake of convenience, weabuse notation slightly and refer to an arbitrary state simply by s and use the notation s i and s j to refer to multiple states. We pointout that our model revolves around states with specific events thatprovide the scope of decision-making. Specifically, decisions needto be taken when incidents occur, when responders finish servicingand while rebalancing the distribution of responders. We also makethe assumption that no two events can occur simultaneously in ourworld. In case such a scenario arises, since the world evolves incontinuous time, we can add an arbitrarily small time interval tosegregate the two events and create two separate states. Actions:
Actions in our world correspond to directing the re-sponders to a valid cell to either respond to an incident or wait.Valid locations include cells with pending incidents or any depotthat has capacity to accommodate additional responders. For a spe-cific agent λ i , valid actions for a specific state s i are denoted by A i ( s i ) (some actions are naturally invalid, for example, if an agent isat cell k in the current state, any action not originating from cell k is unavailable to the agent). Actions can be broadly divided into twocategories - responding and rebalancing . Responding actions referto an agent actually going to the scene of an incident to service it.But agents could also be directed to wait at certain depots basedon the likelihood of future incidents in the proximity of the saiddepot. We refer to such actions as rebalancing. Finally, we reiteratethat the joint valid action space of all the agents and a particularinstantiation of it are defined by A and a respectively, and that ofa specific agent λ j by A j and a j . Transitions:
Having described the evolution of our world, wenow look at both the transition time between states, as well as theprobability of observing a state, given the last state and action taken.We define the former first, denoting the time between two states s i and s j by the random variable t ij . There are four random variablesof interest in this context. We denote the time between incidentsby the random variable t a , the time to service an incident by t s ,the time taken for a balancing step as t b and the time taken for aresponder to reach the scene of an incident by t r . We overload thesenotations for convenience later. Specifically, we model t a using asurvival model described in section 2.1. We model the service times( t s ) by learning an exponential distribution from service times usinghistorical emergency response data, and we model rebalancing time( t b ) simply by the time taken by an agent to move to the directedcell.We refrain from focusing on the transition function P , as ouralgorithmic framework only needs a generative model of the worldand not explicit estimates of state transition probabilities. Rewards:
Rewards in SMDP usually have two components: alump sum instantaneous reward for taking actions, and a continu-ous time reward as the process evolves. Our system only involvesthe former, which we denote by ρ ( s , a ) , for taking action a in state s . We define the exact reward function in section 3.3. .3 Problem Definition Given state s and an agent set Λ , the problem is to determine anaction recommendation set σ = { a , ..., a m } , s . t . a i ∈ A i ( s ) , thatmaximizes the expected reward. The i th entry in σ contains a valid action for the i th agent.Sovling this problem directly is hard due to its intractable statespace. Further, the state transition functions are unknown and diffi-cult to model in closed form, which is typical of urban scenarioswhere incidents and responders are modeled cohesively [16]. Fi-nally, we have to consider the following practical constraints andlimitations. • Temporal constraints — emergency response systems can affordminimum latency ( 5-10 seconds in practice). • Capacity constraints — each depot has a fixed agent capacity. • Uniform severity constraint — all incidents must be respondedto ‘promptly’, without making a judgement about its severitybased on a report or a call. • Wear and Tear — The overall distance agents travel should becontrolled to limit vehicle wear and tear. • Limited Communication - ERM systems must be equipped todeal with disaster situations, where communication is limited.The temporal and uniform severity constraints make it difficult tojustify implementing dispatch policies other than greedy; in orderto improve upon greedy dispatch, some ‘good’ myopic rewardsmust be sacrificed for an increase in expected future rewards. Sinceit is very hard to predict the severity of an incident pre-dispatch,the decision process cannot determine if this sacrifice is acceptable.Therefore, in this work we focus on inter-incident planning whilemaintaining greedy dispatch decisions when an incident is reported.This approach gives the decision-maker more flexibility, as it canproactively position resources rather than reacting to incidents.Our problem then becomes how to distribute responders betweenincidents such that the greedy dispatching rewards are maximized.
Dynamic rebalancing’s flexibility comes with an increase in com-plexity. Consider an example city with | Λ | responders (i.e. agents)and | D | locations where responders could be stationed (called de-pots ) that each can hold one responder. When making a dispatchdecision at the time of an incident, a decision maker has at most | Λ | possible choices: which responder to dispatch. If instead it isre-assigning responders across depots, there are significantly morechoices. For example, with | Λ | =
20 and | D | =
30, there are 20 dis-patching choices per incident, but P (| D | , | Λ |) = | D | ! (| D |−| Λ |) ! = = . × possible assignments. This will only increase if depotshave higher responder capacities.Approaching the problem from this perspective requires solu-tions that can cope with this large complexity. One possible ap-proach is to directly solve the SMDP model. Although the statetransition probabilities are unknown, one can estimate the tran-sition function by embedding learning into policy iteration[16].This approach is unsuitable for rebalancing, as it is too slow evenfor the dispatch problem. A centralized MCTS approach suffers from the same shortcoming, barely satisfies the computational la-tency constraints in case of the dispatch problem[14]. Instead, weseek to exploit meaningful heuristics to propose computationallyfeasible rebalancing strategies. We begin by presenting our firstapproach, which focuses on using historical frequencies of incidentoccurrence across cells to assign responders. One way to address the complexity of rebalancing is by consideringan informed heuristic. A natural heuristic for ERM rebalancing is incident rate — each depot can be assigned responders based on thetotal rate of incidents it serves. Ultimately, our goal is to find a rebal-ancing strategy that minimizes expected response times. As a result,we first estimate the response time given a specific assignment of re-sponders. Such a scenario can be modeled as a multi-server M/M/cqueue [8]. For a given cell and depot, the response time for anM/M/c queue can be represented asresponseTime ( c d , υ , µ ) = ω ( c d , υ / µ ) c d µ − υ + µ where ω ( c d , υ / µ ) = + ( − υc d µ )( c d ! ( c d q ) c ) (cid:205) c d − k = ( c d q ) k k ! (1)where µ = E ( t s ) is the mean service time of responders, c d is thenumber of responders stationed at the depot, υ denotes the rateof incident occurrence at the concerned cell, and q = υcµ is serverutilization. The standard M/M/c model above needs slight adjust-ment to account for the fact that incidents at a cell д can potentiallybe serviced by any depot, which are located at different distancesfrom д . Therefore, we consider a multi-class queue formulation inwhich a cell’s incident rate is split among each depot. Since depotscloser to a cell д are more likely to service its incidents, we split д ’s incident arrival rate such that the fraction of rate incurred by adepot is inversely proportional to the distance to д .The following system of linear equations can be used to split thearrival rate of a cell д among depots D . (cid:213) d ∈ D υ dд = υ д (2a)dist ( (cid:101) d , д ) υ (cid:101) dд = dist ( d i , д ) υ d i д ∀ d i ∈ D \ (cid:101) d (2b)where the variable υ dд is the fraction of arrival rate of cell д thatis shared by depot d , dist ( d , д ) denotes the distance between depot d and cell д , and (cid:101) d is the depot closest to д . Equation 2a ensuresthat the split rates for each cell д ∈ G sum to its actual arrival rate υ д , and equation 2b ensures that the weighted υ ’s are inverselyproportional to the relative distances between the depots and thecell. For convenience, we refer to the entire set of split rates by ϒ .The split rates ϒ provide a foundation for a responder rebal-ancing approach, given a few considerations. First, we might nothave enough responders to meet the total demand based on ϒ . Sec-ondly, the problem of evaluating response times in the context ofemergency response is different than the standard M/M/c queueformulation, since travel times are not memoryless, and must bemodeled explicitly. To address these issues, we design a scoringmechanism for evaluating a specific allocation of responders to lgorithm 1: Iterative Greedy Action Selection INPUT : number of agents | Λ | , depots D , depot capacities C , grid rates υ д ∀ д ∈ G ; final_depot_occupancy := Hash { d : 0 } ∀ d ∈ D ; do candidate_depots := Set ∅ ; candidate_scores := Hash ∅ ; for d ∈ D do if final_depot_occupancy[ d ] < C ( d ) then temp_occ := final_depot_occupancy; temp_occ[ d ] + = ; find ϒ d by solving system of linear equations {2a, 2b} giventemp_occ; π ϒ d : = (cid:205) d ∈ D (cid:205) д ∈ G ( d , Λ ){ responseTime ( temp_occ[ d ] , υ dд , µ ) + travelTime ( d , д )} ; candidate_depots := candidate_depots ∪ d ; candidate_scores := candidate_scores ∪{ d : π ϒ d } best_depot := argmin π ϒ d ∀ d ∈ candidate_depots; final_depot_occupancy[best_depot] + = ; while sum ( final_depot_occupancy ) < | Λ | ; return chosenDepots; depots for a given ϒ . We denote this score by π ϒ . Using ϒ , a respon-der allocation can be scored by summing each depot d ’s expectedresponse time based on the queuing model (calculated using equa-tion 1) and the overall time taken by responders to complete therebalancing: π ϒ = (cid:213) d ∈ D (cid:213) д ∈ G ( d , Λ ){ responseTime ( c d , υ dд , µ ) + travelTime ( d , д )} (3)where ( d , Λ ) is an indicator function which set to 1 only if depot d has at least one responder, and the functions responseTime and travelTime are used to denote the expected response time of a depotand travel times needed by agents to respond to incidents. The goalof an assignment method is then to find a responder allocationthat minimizes this heuristic score. To minimize the total score weemploy an iterative greedy approach, shown in algorithm 1. Oncethe best depots are found, responders are assigned to them basedon their current distance from the depots.The approach dramatically decreases the computational complex-ity of rebalancing compared to a brute force search. The complexityfor solving the system of linear equations {2a, 2b} is O(| Λ | ) , asthere are at most | Λ | depots that could have a resource allocated.The rates are split for each cell д ∈ G and new depot under consid-eration d during each iteration of the greedy search in algorithm 1,which is repeated | Λ | times to place each responder. This gives theoverall algorithm a complexity of O(| G || D || Λ | ) . Taking the sameexample given above with | Λ | =
20 and | D | =
30 and assuming | G | =
900 (based on our geographic area of interest and patrol areaschosen by local emergency responders), the complexity is 1 × times less than a brute force search.While this approach is not inherently decentralized, each agentcan perform these computations and take actions themselves, re-quiring minimal coordination. While straightforward and tractable,there are a few potential downsides to this approach. First, thispolicy does not take into account the internal state of the system. For example, a responder might be on its way to respond to anincident, thereby rendering it unavailable for rebalancing. Secondly,it assumes that historical rates of incident arrival can be used tooptimize responder placement for the future, thereby not consider-ing how future states of the system affect a particular rebalancingconfiguration. To address these issues, we propose a decentralizedMonte-Carlo Tree Search algorithm. Monte-Carlo Tree Search (MCTS) is a simulation-based search al-gorithm that has been widely used in game playing scenarios.MCTS based algorithms evaluate actions by sampling from alarge number of possible scenarios. The evaluations are stored in asearch tree, which is used to explore promising actions. Typically,exploration policy is dictated by a principled approach like UCT[12].A standard MCTS-based approach is not suitable for our problemdue to the sheer size of the state-space in consideration coupledwith the low latency that ERM systems can afford. Instead, we focuson a decentralized multi-agent MCTS (MMCTS) approach exploredby Claes et. al [3] for multi-robot task allocation during warehousecommissioning. In MMCTS individual agents build separate treesfocused on their own actions, rather than having one monolithic,centralized tree. This dramatically reduces the search space: in ourcase, at each evaluation step of a Monte-Carlo based approach,using a decentralized multi-agent search reduces the total numberof choices from the number of permutations P (| D | , | Λ |) = | D | ! (| D |−| Λ |) ! to only the number of depots | D | .To realize MMCTS for an ERM domain, some extensions needto be made to standard UCT [7]. While an agent is building its owntree, it must model other agents’ behavior. Since this estimationis required at every step of every simulation by each agent, find-ing a model that strikes a balance between computation time andaccuracy of predicted actions is vital.There are also global constraints on the system which mandateagents maintain a minimal degree of coordination. For example,the number of resources assigned to a depot cannot be higher thanits capacity. We take this into account by adding a filtering step tothe decision process. Similar to Map-Reduce [5], each agent sendstheir evaluated actions to a central planner which makes the finaldecisions while satisfying global system constraints.Next, we describe the architecture of our decentralized MMCTSbased algorithm. • Reward Structure:
At the core of an MCTS approach is an eval-uation function that measures the reward of taking an action ina given state. For a state s in the tree of agent λ j , we design thereward ρ of taking an action a in s as ρ ( s , a ) = ρ s − − α t h ( t r ( s , a )) , if responding to an incident ρ s − − α t h ψ (cid:205) λk ∈ Λ ( ϕ k ( s , a ))| Λ | , if balancing at s (4a)where ρ s − refers to the total accumulated reward at the parentof state s in the tree, α is the discount factor for future rewards,and t h the time since the beginning of the planning horizon t .The evaluation function is split into cases reflecting the separate incident dispatch and balancing steps in our solution approach. lgorithm 2: Decision Process INPUT : state s , time limit t lim ; I := Sample Incidents(s) E := I + rebalancing events ranked action set (cid:101) A : = ∅ ; for Agent λ j ∈ Λ do (cid:101) A[ λ j ] := MMCTS(s, λ j , E , t lim ); recommended actions σ := CentralizedActionFilter( s , (cid:101) A ); apply σ to s ; Return s;
In a dispatch step, the reward is updated with the discountedresponse time to the incident t r ( s , a ) . In a balancing step, weupdate the reward by the average distance traveled by the agents(we denote the distance traveled by agent λ k while balancingdue to action a in s by ϕ k ( s , a ) ). ψ is an exogenous parameterthat balances the trade-off between response time and distancetraveled for balancing, and is set by the user depending on theirpriorities. Distance is not included during dispatch actions, aswe always send the closest agent. • Evaluating other agents’ actions:
Agents must have an accu-rate yet computationally cheap model of other agents’ behavior;we explore two such possible policies — (1) a naive policy thatother agents will not rebalance, remaining at their current depot(referred to as
Static Agent Policy ), and (2) an informed policy,which is in the form of the
Queue Rebalancing Policy describedin the section 3.2. These are used to select actions for the otheragents Λ \{ λ i } when building agent λ i ’s search tree, and are rep-resented by the ActionSelection(available agents, state) functionin line 5 of algorithm 4. • Rollout:
When working outside the MMCTS tree, i.e. rollingout a state, a fast heuristic is used to estimate the score of agiven action. We use greedy dispatch without balancing as ourheuristic. • Action Filtering:
The dispatching domain has several globalconstraints to adhere to, including ensuring that an incident isserviced if agents are available and that depots are not filledover capacity. To meet these constraints, we propose a filteringstep be added to the MMCTS workflow, similar to Map-Reduce.Once each individual agent has scored and ranked each possibleaction, these are sent to a centralized filter that chooses the finalactions for each agent to maximize utility without breaking anyconstraints.Another way global constraints affect the workflow is that theset of valid actions for an agent when they build their search treemay not be the same as the valid actions when it comes time forthem to make a decision. For example, consider two agents λ and λ ; if agent λ moves to a station and fills it to capacity, then agent λ cannot move to that station. To address this, we have agentsevaluate every action they could possibly take when expandingnodes in the tree, even if those actions would cause an invalid state.As the filter assigns actions to other agents, some of these actionscan become valid. To realize an online ERM decision support system requires a frame-work of interconnected processes. Our integration framework is
Algorithm 3:
MMCTS INPUT : state s, agent λ j , sampled events E , time limit t lim ; create root of search tree at s; do select most promising node n from tree using UCB1; childNode := Expand( n , λ j , next event e ∈ E after state( n )); r c := Rollout(childNode); back-propagate(child, r c ) while within time limit t lim ; return actions λ j could take ranked by average reward Algorithm 4:
Expand INPUT : Search Tree Node n , agent λ j , next important event e ; if e is balancing step then select un-explored action a ∈ A j ; λ j takes action a ; actions available to other agents are updatedActionSelection( Λ \{ unavailable agents } , state( n )); else if e is an incident then dispatch nearest agent to incident create new child node n c from selected actions; update the child’s reward based on the response times (if any) and agentbalancing movement update n c to the time of the next event e , fast forwarding the state; return n c ; Algorithm 5:
Centralized Action Filter INPUT : state s , ranked actions (cid:101) A ; Λ avail := agents( s ) do candidate_actions := ∅ ; for Agent λ j ∈ Λ avail do candidate_actions[ λ j ] := valid action a j ∈ (cid:101) A[ λ j ] with highestreward ρ ( s , a ) ; find agent λ j with highest scored action a j ∈ candidate_actions; λ j takes action a j ; update actions available to other agents accordingly; remove λ j from Λ avail ; while there are unassigned agents ; built on our prior modular ERM pipeline work[13]. It includes thefollowing components: • A traffic routing model to support routing requests. • A model of the environment and how it changes over time,which is used by the incident prediction model. • A model of the spatio-temporal distribution of incidents. • A decision process that makes dispatching recommendationsbased on the current state of the environment, responder loca-tions, and future incident distributions.This framework is a natural choice as it decouples the decisionprocess (our focus in this work) from other components. As it wasdesigned for the centralized, post-incident dispatching approach,we make necessary changes to adapt it to our needs. The underlyingdiscrete event simulation was generalized to accept events otherthan incident occurrence, such as periodic balancing events. Thedecision process was also extended to handle distributed, multi-agent approaches. An overview of the extended framework can beseen in figure 1.In our experiments we use a Euclidean distance based router,and the incident prediction model outlined in section 2. Due to igure 1: Extended Decentralized ERM Framework Overview the framework’s modularity, these components can be replacedwithout affecting the decision process.
Incident Prediction Model:
While the broader approach ofrebalancing the spatial distribution of responders is flexible enoughto work with any modular incident forecasting model, we providea brief evaluation of forecasting using survival analysis. To thisend, we generate forecasts 4 hours into the future at intervalsof every half an hour for the entire test set, and then repeat theprocedure 5 times to reduce variance and increase our confidence inthe forecasts. Finally, we create a heatmap (average of all forecastedrates in the test set) to visualize the performance of the modelin comparison to actual incidents (see figure 2). The forecastingmodels captures the high and low density areas fairly accurately,as well as the spatial spread of the incidents.
We perform our evaluation on data from Nashville, TN, a majormetropolitan area of USA, with a population of approximately700,000. The depot locations are based on actual ambulance stationsobtained from the city. Traffic accident data was obtained fromthe Tennessee Department of Transportation, and includes thelocation and time of each incident. The incident prediction modelwas trained on 35858 incidents occurring between 1-1-2018 and1-1-2019, and we evaluated the decision processes on 2728 incidentsoccurring in the month of January, 2019.
Experimental Configuration and Assumptions:
We limitthe capacity of each depot to 1 in our experiments. This is motivatedby two factors — first, it encourages responders to be geographi-cally spread out to respond quickly to incidents occurring in anyregion of the city, and it models the usage of ad-hoc stations byresponders, which are often temporary parking spots. While theresponder service times to incidents are assumed to be exponentialin the real world, we set them to a constant for these experiments.This ensures that the experiments across different methods andparameters are directly comparable. If deployed, however, properservice time distributions should be learned and sampled from foreach ERM system. We set the total number of responders to 26,which is the actual number of responders in Nashville. We splitthe geographic area into 900, 1x1 mile square cells. This choicewas a consequence of the fact that a similar granularity of dis-cretization is followed by local authorities. To smooth out modelnoise, each agent evaluates 5 sampled incident chains from thegenerative model and averages the scores for each action acrossthe playouts. The standard UCB1 [1] algorithm is used to select themost promising node during MCTS iterations. Finally, we augmentthe queue based rebalancing policy by adding a radius of influence
Figure 2:
Heatmaps comparing average incident rates for the fore-casting model (left) with actual incidents in Nashville, TN (right) . (RoI) for each cell. Only depots within a cell’s RoI are consideredwhen splitting its rate to encourage even agent distribution andreduce computation time. We now discuss the results of the experiments for the two policies.
We first compare the queue based rebalancing policy describedin section 3.2 to the baseline policy of no rebalancing. In theseexperiments rebalancing occurred every half hour, and the incidentrates υ were average historical rates from the training data. Wetested several values (in miles) for the depots’ RoI, and comparedthe distributions of response times (figure 3a) and the rebalancingdistance traveled by each responder (figure 3b).Our first observation is that increasing the RoI does not neces-sarily increase performance; there is an optimal zone around RoI=3,implying that encouraging responders to spread out is beneficial.We also see that while Q-3’s median and 1st quartile response timesremained fairly consistent with the baseline, the upper quartilesare reduced. This decreases the response time’s mean and variance,making the system more fair to all incidents. We also observe thatQ-2 and Q-3’s responders traveled less than 1 mile on average eachbalancing step. To determine the potential of the MMCTS rebalancing approach, wefirst compare the two agent action models described in section 3.3(
Static Agent Policy and
Queue Rebalancing Policy ) using an oracle ,which has complete information regarding future incidents (thisassumption takes the errors of the prediction model out of compar-ison and enables us to observe the best results that we can obtain) .We present the results for the response time distributions in figure3c and the average responder distance traveled per rebalancing stepin figure 3d.Our first observation is that the MMCTS approach has highpotential. Using an oracle, it is able to significantly decrease theresponse time distribution compared to the queue based policyabove. This is not surprising given that a standard MCTS algorithmgiven perfect information should perform well given adequate time,but it demonstrates that the MMCTS extensions of independent Response times (Minutes)BASEQ-1Q-2Q-3Q-4Q-5 (a) . . . . . . Distance Moved (Miles)Q-1Q-2Q-3Q-4Q-5 (b)
Response times (Minutes)BASEMR-1MR-2 (c)
Distance Moved (Miles)MR-1MR-2 (d)
Response times (Minutes)BASEM-1M-2M-3M-4M-5M-6 (e)
Distance Moved (Miles)M-1M-2M-3M-4M-5M-6 (f)Figure 3: a) The response time distributions for each queue rebalancing policy experiment. b) Distribution of average miles traveled by eachresponder at each balancing step in the queue rebalancing policy experiments. The baseline approach has no rebalancing, so it is excluded. c)The response time distributions for each MMCTS experiment using an oracle. d) Distributions of average miles traveled by each responderat each balancing step of the MMCTS experiments using an oracle. e) The response time distributions for each MMCTS parameter searchexperiment. f) Distributions of average miles traveled by each responder at each balancing step of the MMCTS parameter search experiment. action evaluation for each agent and action filtering are valid. Sec-ondly, we see that MR-1 (using a static agent policy) outperformsMR-2 (using the queue rebalancing policy). Last, we observe thatresponders traveled between 2 and 4 miles on average each duringbalancing step in these experiments, which is significantly higherthan the queuing approach.Next, we examine a more the practical approach using the inci-dent prediction model based on survival analysis. Since the staticagent policy performed better in the oracle experiments, we useit for these experiments. There are several hyper-parameters thatcan affect the performance of the algorithm, including (1) MCTSIteration Limit (2) Rebalancing Period - the amount of time be-tween rebalancing steps (3) Distance Weight in Reward Function ψ - this represents the importance of distance traveled for rewards(4) Look-ahead Horizon for MCTS.We vary these parameters to see their effect on the system (seetable 2). We present the response time distributions of MMCTSusing the incident model in figure 3e, and the average responderdistance traveled per rebalancing step in figure 3f. We observe thatdifferent parameter choices lead to different performance character-istics. For example, we see that changing the distance weight hasa large impact on the distance responders travel; users with tightbudgets for responder movement and maintenance will want topay close attention to this parameter. Comparing the queue basedpolicy with MMCTS, we see that both improve the response timedistributions compared to the baseline. MMCTS is more config-urable, but is also more sensitive to poor hyper-parameter choices.With proper hyper-parameter choices, both fulfil the constraintsdiscussed in section 2.3 by having quick dispatching decisions, al-lowing for limited communication, and allowing users to controlfor distance traveled (i.e. wear and tear). Principled approaches to Emergency Response Management (ERM)decision making have been explored, but have failed to be imple-mented into real systems. We have identified that a key issue withthese approaches is that they focus on post-incident decision mak-ing. We argue that due to fairness constraints, planning shouldoccur between incidents. We define a decision theoretic model forsuch planning, and implement both a heuristic search using queuingtheory and a Multi Agent Monte Carlo Tree Search planner. We find
Table 2: Outline of the experimental runs performed and their cor-responding hyper-parameter choices. ( ∗ When not indicated, param-eters are set to values of M-1, the MMCTS Baseline in the table.)
Identifier Description Hyper-Parameter Choices
BASE Greedy Baseline Without Rebalancing N/AQ-1 Queue Based Rebalancing Policy with RoI of 1 RoI = 1Q-2 Queue Based Rebalancing Policy with RoI of 2 RoI = 2Q-3 Queue Based Rebalancing Policy with RoI of 3 RoI = 3Q-4 Queue Based Rebalancing Policy with RoI of 4 RoI = 4Q-5 Queue Based Rebalancing Policy with RoI of 5 RoI = 5MR-1 MMCTS - using an oracle for future incidentsand a Static Agent Policy
Same as MMCTS Baseline M-1
MR-2 MMCTS - using an oracle for future incidentsand a Queue Rebalancing Policy
Same as MMCTS Baseline M-1
M-1 MMCTS - BaselineThe foundation for the parameter search.Each parameter varies independently whileother parameters retain these values.(All M-* experiments use generated incidentchains and a Static Agent Policy) MCTS Iteration Limit = 250Lookahead Horizon = 120 minReward Distance Weight ψ = 10Reward Discount Factor = 0.99995Rebalance Period = 60 minM-2 MMCTS - Iteration Limit of 100 MCTS Iteration Limit = 100*M-3 MMCTS - Iteration Limit of 500 MCTS Iteration Limit = 500*M-4 MMCTS - Reward Distance Weight ψ of 0 Reward Distance Weight ψ = 0*M-5 MMCTS - Reward Distance Weight ψ of 100 Reward Distance Weight ψ = 100*M-6 MMCTS - Rebalance Period of 30 minutes;Lookahead Horizon of 30 minutes Lookahead Horizon = 30 minRebalance Period = 30min* that these approaches maintain system fairness while decreasingthe average response time to incidents.While the focus of this work is in the ERM domain, there areimportant takeaways for general agent-based systems: (1) Planningperformance is dependent on the quality of the underlying eventprediction models. (2) It is imperative to understand the needs andconstraints for a target domain when designing a planning approachfor it to be accepted in practice. (3) The computational capacity of“agents” has evolved in recent decades, and should be used to createdecentralized planning approaches. Given these takeaways, we willexplore the applicability of this framework to other domains whereplanning occurs over a spatial-temporally evolving process. Acknowledgement:
This work is sponsored by the NationalScience Foundation ( CNS-1640624, IIS-1814958, IIS-1905558), theTennessee Department of Transportation and the Center for Auto-motive Research at Stanford (CARS). We thank our partners fromMetro Nashville Fire Department, and Metro Nashville InformationTechnology Services in this work. We would also like to thankHendrik Baier (CWI) for insights and helpful discussions regardingthe paper. We also thank Sayyed Mohsen Vazirizade (Vanderbilt)for his help in reviewing the manuscript and providing importantfeedback. EFERENCES [1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis ofthe multiarmed bandit problem.
Machine learning
47, 2-3 (2002), 235–256.[2] Craig Boutilier. 1996. Planning, learning and coordination in multiagent decisionprocesses. In
Proceedings of the 6th conference on Theoretical aspects of rationalityand knowledge . Morgan Kaufmann Publishers Inc., 195–210.[3] Daniel Claes, Frans Oliehoek, Hendrik Baier, and Karl Tuyls. 2017. Decentralisedonline planning for multi-robot warehouse commissioning. In
Proceedings of the16th Conference on Autonomous Agents and MultiAgent Systems . InternationalFoundation for Autonomous Agents and Multiagent Systems, 492–500.[4] One Concern. 2017.
Artificial Intelligence: A GameChanger for Emergency Response .Technical Report. One Concern.[5] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processingon Large Clusters.
Commun. ACM
51, 1 (Jan. 2008), 107–113. https://doi.org/10.1145/1327452.1327492[6] Nashville Fire Department. 2018. Private Communication. (2018).[7] Johannes Fürnkranz and Tobias Scheffer. 2006.
Machine Learning: ECML 2006:17th European Conference on Machine Learning, Berlin, Germany, September 18-22,2006, Proceedings . Vol. 4212. Springer Science & Business Media.[8] Natarajan Gautam. 2012.
Analysis of queues: methods and applications . CRCPress.[9] Mohammad Ghavamzadeh and Sridhar Mahadevan. 2006. Learning to Cooperateusing Hierarchical Reinforcement Learning. (2006).[10] Shenyang Guo. 2010.
Survival analysis . Oxford University Press.[11] Sean K Keneally, Matthew J Robbins, and Brian J Lunday. 2016. A markov decisionprocess model for the optimal dispatch of military medical evacuation assets.
Health care management science
19, 2 (2016), 111–129.[12] Levente Kocsis and Csaba Szepesvári. 2006. Bandit based monte-carlo planning.In
European conference on machine learning . Springer, 282–293.[13] Ayan Mukhopadhyay, Geoffrey Pettet, Chinmaya Samal, Abhishek Dubey, andYevgeniy Vorobeychik. 2019. An online decision-theoretic pipeline for responderdispatch. In
Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems . ACM, 185–196.[14] Ayan* Mukhopadhyay, Geoffrey* Pettet, Chinmaya Samal, Abhishek Dubey,and Yevgeniy Vorobeychik. 2019. An Online Decision-Theoretic Pipeline forResponder Dispatch. In
ACM/IEEE International Conference on Cyber-Physical Systems . ACM, 12–pages.[15] Ayan Mukhopadhyay, Yevgeniy Vorobeychik, Abhishek Dubey, and GautamBiswas. 2017. Prioritized Allocation of Emergency Responders based on aContinuous-Time Incident Prediction Model. In
International Conference onAutonomous Agents and MultiAgent Systems . International Foundation for Au-tonomous Agents and Multiagent Systems, 168–177.[16] Ayan Mukhopadhyay, Zilin Wang, and Yevgeniy Vorobeychik. 2018. A DecisionTheoretic Framework for Emergency Responder Dispatch. In
Proceedings of the17th International Conference on Autonomous Agents and MultiAgent Systems,AAMAS 2018, Stockholm, Sweden, July 10-15, 2018 . 588–596. http://dl.acm.org/citation.cfm?id=3237471[17] Ayan Mukhopadhyay, Chao Zhang, Yevgeniy Vorobeychik, Milind Tambe, Ken-neth Pence, and Paul Speer. 2016. Optimal Allocation of Police Patrol ResourcesUsing a Continuous-Time Crime Model. In
Conference on Decision and GameTheory for Security .[18] Geoffrey Pettet, Saideep Nannapaneni, Benjamin Stadnick, Abhishek Dubey,and Gautam Biswas. 2017. Incident analysis and prediction using clusteringand Bayesian network. In . IEEE, IEEE, San Francisco, CA,USA, 1–8.[19] H. Purohit, S. Nannapaneni, A. Dubey, P. Karuna, and G. Biswas. 2018. Struc-tured Summarization of Social Web for Smart Emergency Services by UncertainConcept Graph. In . 30–35. https://doi.org/10.1109/SCOPE-GCTC.2018.00012[20] Khashayar Rohanimanesh and Sridhar Mahadevan. 2003. Learning to take con-current actions. In
Advances in neural information processing systems . 1651–1658.[21] Wikipedia contributors. 2019. Computer-aided dispatch — Wikipedia, TheFree Encyclopedia. (2019). https://en.wikipedia.org/w/index.php?title=Computer-aided_dispatch&oldid=916096608 [Online; accessed 20-October-2019].[22] Yisong Yue, Lavanya Marla, and Ramayya Krishnan. 2012. An efficient simulation-based approach to ambulance fleet allocation and dynamic redeployment. In
Twenty-Sixth AAAI Conference on Artificial Intelligence ..