[PDF] A Coordinated MDP Approach to Multi-Agent Planning for Resource Allocation, with Applications to Healthcare

Abstract

This paper considers a novel approach to scalable multiagent resource allocation in dynamic settings. We propose an approximate solution in which each resource consumer is represented by an independent MDP-based agent that models expected utility using an average model of its expected access to resources given only limited information about all other agents. A global auction-based mechanism is proposed for allocations based on expected regret. We assume truthful bidding and a cooperative coordination mechanism, as we are considering healthcare scenarios. We illustrate the performance of our coordinated MDP approach against a Monte-Carlo based planning algorithm intended for large-scale applications, as well as other approaches suitable for allocating medical resources. The evaluations show that the global utility value across all consumer agents is closer to optimal when using our algorithms under certain time constraints, with low computational cost. As such, we offer a promising approach for addressing complex resource allocation problems that arise in healthcare settings.

Full PDF

AA Coordinated MDP Approach to Multi-Agent Planning forResource Allocation, with Applications to Healthcare

Hadi Hosseini

David R. Cheriton School ofComputer ScienceUniversity of Waterloo [email protected] Jesse Hoey

David R. Cheriton School ofComputer ScienceUniversity of Waterloo [email protected] Robin Cohen

David R. Cheriton School ofComputer ScienceUniversity of Waterloo [email protected]

ABSTRACT

This paper considers a novel approach to scalable mul-tiagent resource allocation in dynamic settings. Wepropose an approximate solution in which each resourceconsumer is represented by an independent MDP-basedagent that models expected utility using an averagemodel of its expected access to resources given onlylimited information about all other agents. A globalauction-based mechanism is proposed for allocationsbased on expected regret. We assume truthful biddingand a cooperative coordination mechanism, as we areconsidering healthcare scenarios. We illustrate the per-formance of our coordinated MDP approach againsta Monte-Carlo based planning algorithm intended forlarge-scale applications, as well as other approachessuitable for allocating medical resources. The evalua-tions show that the global utility value across all con-sumer agents is closer to optimal when using our algo-rithms under certain time constraints, with low compu-tational cost. As such, we offer a promising approachfor addressing complex resource allocation problemsthat arise in healthcare settings.

Categories and Subject Descriptors

I.2.11 [

Distributed Artiﬁcial Intelligence ]: Multiagent Systems

General Terms

Algorithm, Experimentation

Keywords

Multiagent Planning, Multiagent MDP, Healthcare Applications

1. INTRODUCTION

This paper develops an approach for allocating resources in mul-tiagent systems for domains where there are multiple agents andmultiple tasks, and the success of the agents carrying out tasks isdependent stochastically on their ability to obtain a sequence of re-sources over time. We are particularly interested in situations whereagents must independently optimize over their individual states,actions, and utilities, but must also solve a complex coordinationproblem with other agents in the usage of limited resources.

Appears in

The Eighth Annual Workshop on Multiagent SequentialDecision-Making Under Uncertainty ( MSDM-2013 ), held in con-junction with

AAMAS , May 2013, St. Paul, Minnesota, USA.

In particular, we are concerned with allocating resources in set-tings that involve a set of N consumers , each of whom requiressome subset of a total of M resources . The consumers each have ameasure of health that they are trying to optimize, and this qual-ity is inﬂuenced stochastically by the resources they acquire and bytime. Further, each consumer has a resource pathway that repre-sents the partial ordering in which they need the resources. Con-sumers’ states evolve independently over time, and are dependentonly through their need for shared resources. Rewards are inde-pendent, and the global reward is the sum of individual consumerrewards.We formulate this problem as a factored multiagent Markov De-cision Process (MMDP) with explicit features for each consumer’sstate and resource utilization, and an explicit model of how eachconsumer’s state progresses stochastically over time dependent onobtained resources. The actions are the possible allocations of re-sources in each time step. For realistic numbers of consumers andresources, however, such an MMDP has a state and action spacethat precludes computation of an optimal policy. This paper ad-dresses this problem and makes three contributions:1. We develop an approximate distributed approach, where thefull MMDP is broken into N MDPs, one for each consumer.We call these consumer MDPs agents . Agents model theresources they expect to obtain using a probability distribu-tion derived from average statistics of the other agents, andcompute expected regret based on this distribution and on theknown dynamics of their health state.2. We propose an iterative auction-based mechanism for real-time resource allocation based on the agents’ individual ex-pected regret values. The iterative nature of this process en-sures a reasonable allocation at minimal computational cost.3. We demonstrate the advantages of our approach in a coop-erative healthcare domain with patients seeking doctors andequipment in order to improve their health states. We presentaverages of simulations using randomly generated agents froma reasonable prior distribution. We compare our coordinatedMDP approach against an alternate planning algorithm in-tended for large-scale applications, a state-of-the-art MonteCarlo sampling based method for solving the full MMDPmodel known as UCT. We also compare to two simple but re-alistic heuristic approaches for allocating medical resources.Our approach is particularly well suited to large collaborative do-mains that require rapid responses to resource allocation demands We use the term health here in a general sense to denote a singlequantity over which an agent’s utility function (and hence, its re-ward) is deﬁned. This can be for e.g. quality of a solution, value ofan outcome, or patient state of health . a r X i v : . [ c s . A I] J u l n time-critical domains, and we use a healthcare scenario through-out the paper to clarify our solution. We start by introducing theMMDP model and our distributed approach, followed by descrip-tions of the baseline methods we compare to. We then develop a setof realistic models for use in simulation, and show results across arange of problem sizes.

2. MDPS AND COORDINATION

Our model is a factored MDP represented as a tuple of elements (cid:104)

N, M, τ, R , H , P T , Φ , A (cid:105) where N is the number of consumers, M the number of resources, and τ is the planning horizon. R = { R , . . . , R N } is a ﬁnite set of resource variables, each one repre-senting the state of a single consumer’s resource utilizations, where R i = { R i , R i , . . . , R iM } is a set of variables representing con-sumer i ’s utilization of resource j . Each R ij ∈ R where R isthe set of possible resource utilizations (how much resource is be-ing used). We model each resource as distinct (so multiple copiesof a resource are modeled separately). H = { H , . . . , H N } isa set of N variables measuring each consumer’s health, each ofwhich is H i ∈ H giving the different levels of health. We use s i = { R i , H i } to denote the complete set of state variables forconsumer i , and S : ( s , ..., s N ) to denote the complete state forall consumers. Agent i receives a reward of Φ i ( s i , s (cid:48) i ) for transi-tion from s i to s (cid:48) i , thus the multiagent system’s reward function is Φ( S, S (cid:48) ) = (cid:80) i Φ i ( s i , s (cid:48) i ) . The transition model is deﬁned as P T ( S (cid:48) | S, A ) = (cid:81) i P i ( s (cid:48) i | s i , a i ) , which denotes the probabilityof reaching joint state S (cid:48) when in joint state S , and A is a set ofpermissible actions , one for each resource and each consumer rep-resenting all feasible allocations of resources (so the same resourcecannot be allocated to two agents simultaneously). Resources aredeterministic given the actions, and only one resource can be al-located to each consumer at a time. We assume a ﬁnite horizonundiscounted setting .The full MDP as described is an instance of a multiagent MDP(MMDP), and will be very challenging to solve optimally for rea-sonable numbers of consumers and resources. The total number ofstates is | S | = |H| N |R| MN , and the number of actions is N !( N − M )! .We will show how to compute approximate (sample-based) solu-tions later in this paper, but ﬁrst we show our approach to distribut-ing this large MDP into N smaller MDPs, and introduce our coor-dination mechanism for computing approximate allocations.Figure 1: A patient’s MDP with 3 resources shown as a two timeslice inﬂuence diagramWe treat each consumer’s MDP as independent (an agent ), an This is realistic in healthcare scenarios as health states do not war-rant discounting. example of which is shown in Figure 1. We assume that the agent’sstate spaces, resource utilizations, health states, transition and re-ward functions are independent. The agents are only dependentthrough their shared usage of resources: only feasible allocationsare permitted as described above (agents can’t simultaneously shareresources). Rewards are additive and each agent’s actions now be-come requests for resources as described below. We make twofurther assumptions. First, the reward function for each agent isdependent on the agent’s health, H, and is set to zero by a booleanfactor at the end of resource acquisition (ﬁnishing the medical path-way by receiving all required resources). Second, the agent health(H) is conditionally independent of the agent action given the cur-rent resources and the previous health, and the agent actions onlyinﬂuence the resource allocation, since the agent can only inﬂuencehealth indirectly by bidding for resources. Thus, for each agent i , P i ( r (cid:48) , h (cid:48) | r , h, a ) factors as P i ( r (cid:48) , h (cid:48) | r , h, a ) = P i ( r (cid:48) | r , h, a ) P i ( h (cid:48) | r , h ) (1)where we deﬁne Λ R ≡ P i ( r (cid:48) | r , h, a ) is the probability of gettingthe next set of resources given the current health, resources, andaction, and Ω H ≡ P i ( h (cid:48) | r , h ) is a dynamic model for the agent’shealth rate. We will refer to Λ R as the resource obtention modeland to Ω H as the health progression model. Health progression is a property of a particular agent’s conditionor task and can be estimated from global statistics about the natureof the conditions (e.g. diseases). Ω H must be elicited from priorknowledge about diseases and treatments, and so forms part of a disease model that we henceforth assume is pre-deﬁned (manually,or by learning based on historical statistics). On the other hand,the resource obtention model, Λ R , will be dependent on the cur-rent state of the multiagent system, and is a property of how we aresetting up our resource allocation mechanism and the expected re-gret computations of each agent. For example, the probability of asingle agent obtaining a resource will depend on (i) the number ofother agents currently bidding for that resource and (ii) the agent’smodel of health.If using a single MDP for all agents as described at the start ofthis section, then resources would be deterministic given a jointallocation action. If modeled as a decentralized POMDP, the re-sources for each consumer would be conditioned on the unobserv-able states and actions of all the other consumers. In our model,we assume that the probability of obtaining a certain resource canbe approximated reasonably well, either as a proior model basedon the known distribution of diseases and the known requirementsfor treatments of each disease, or as a learned distribution based onsimulated or real experiments.In general, we can make no assumptions about further condi-tional independencies in the resource allocation factor. That is,the probability of obtaining a resource R (cid:48) at time t may dependstochastically on the set of resources at time t − . However, inmany domains, there may be further independencies that can beencoded in the model. For example, in Figure 1, resource R (cid:48) i isconditionally independent of all resources R j where j / ∈ { i, i − } (for i > ) and for j / ∈ { i } (for i = 1 ), so the resources are ordered according to the (linear) medical pathway of this particular patient.We assume that the health progression factor can be speciﬁed foreach agent independently of the other agents.A policy for each individual MDP is a function π i ( s i ) (cid:55)→ A i thatgives an action for an agent to take in each state s i . The policy canbe obtained by computing a value function V ∗ i ( s i ) for each state s i ∈ S i , that is maximal for each state (i.e. satisﬁes the Bellmanequation [2]). For simplicity of notation, we remove agent indicesnd only show the indices for resources. Thus an individual agent’svalue function is represented as: V ∗ ( s ) = max a γ (cid:88) s (cid:48) ∈ S [Φ( s, s (cid:48) ) + P ( s (cid:48) | s, a ) V ∗ ( s (cid:48) )] (2)The policy is then given by the actions at each state that are thearguments of the maximization in Equation 2.Agents compute their expected regret for not obtaining a givenresource as follows. The expected value, Q i ( h, r , a i ) for being inhealth state h with resources r at time t , bidding for (denoted a i )and receiving resource r i at time t + 1 is: Q i ≡ (cid:88) r (cid:48)− i (cid:88) h (cid:48) P ( h (cid:48) | h, r ) V ( r (cid:48) i , r (cid:48)− i , h (cid:48) ) δ ( r − i , r (cid:48)− i ) where r − i is the set of all resources except r i and δ ( x, y ) = 1 ↔ x = y and otherwise. The equivalent value for not receiving theresource, ¯ Q i ( h, r , a i ) , is ¯ Q i ≡ (cid:88) r (cid:48)− i (cid:88) h (cid:48) P ( h (cid:48) | h, r ) ¯ V (¯ r (cid:48) i , r (cid:48)− i , h (cid:48) ) δ ( r − i , r (cid:48)− i ) Thus, the expected regret for not receiving resource r i when in h with resources r and taking action a i is: R i ( h, r , a i ) = Q i − ¯ Q i (3)We also refer to this as the expected beneﬁt of receiving r i . Itis important for agents in this setting to consider regret (or bene-ﬁt) instead of value, as two agents may value a resource the same,but one might depend on it much more (e.g. have no other option).Value-based bids will fail to communicate this important informa-tion to the allocation mechanism.Note that Q is an optimistic estimate, since the expected valueassumes the optimal policy can be followed after a single time step(which is untrue). This myopic approximation enables us to com-pute on-line allocations of resources in the complete multiagentproblem, as described in the next section. In the following, wewill use the notion of utilitarian social welfare by aggregating thetotal rewards amongst all agents as an evaluation measure. A coordination mechanism must aim to respect the health needsof the patients to maximize the overall utility. Each agent estimatesits expected individual regret given its estimate of future resourcesand health (as given by Λ R and Ω H ). The regret values of differentagents are compared globally, and an allocation is sought that min-imizes the global regret. While the ﬁnal allocation decisions aremade greedily in the action-selection phase, the reported expectedvalues of regret (for bidding) consider future rewards.To implement this allocation, we use an iterative auction-likeprocedure, in which each consumer bids on the resource with high-est regret. The highest bidder gets the resource, and all other agentsbid on their next highest regret resource. Agents can also resign ,receive no resources for one time step, and try again in a future timestep. Consider a simpliﬁed scenario with 4 agents and 4 resources.We are assuming that agents require all four resources and the ex-pected beneﬁts for receiving resources (or regrets for not receivingresources) based on their internal utility function have been calcu-lated as illustrated in Table 1. The worst-case scenario would be when all the agents have attributed higher beneﬁts to the same re-sources, so that their desire to acquire resources is in the same orderor preference.

Agents r r r r a *7 8 9 a a *4 a *8 (a) Worst-case Agents r r r r a a a *6 a (b) Average-caseTable 1: Example scenarios: 4 agents and 4 resources. *X showsthe optimal allocation, while X shows our method.Agents ﬁrst try to acquire the resource with highest beneﬁt. Inthis scenario, all agents have associated the highest beneﬁt to r ,however, only one ( a ) would be successful in getting it. All agentswho have lost the previous auction, will now bid for the resourcewith the second-highest beneﬁt, and so on. In this case, agents a , a , a all have attributed r as their second highest. Our auction-based method gives a beneﬁt of 22 (shown in bold in Table 1a).The optimal allocation has the beneﬁt of 25 (one shown with * inTable 1a).Table 1b shows an average-case scenario. Again we are assum-ing all agents require all the resources but with more diverse pref-erences over the set of resources. Our method gets a beneﬁt of compared to the optimal beneﬁt of .

3. BASELINE SOLUTION METHODS3.1 Sample-Based

We will compare our algorithm to the result of a sample-basedsolution on the full MMDP as described at the start of this section.UCT is a rollout-based Monte Carlo planning algorithm [11] wherethe MDP is simulated to a certain horizon many times, and the aver-age rewards gathered are used to select the best action to take next.To balance between exploration and exploitation, UCT chooses anaction by modeling an independent multi-armed bandit problemconsidering the number of times the current node and its chosenchild node has been visited according to the UCB1 policy [1]. Ingeneral, UCT can be considered as an any-time algorithm and willconverge to the optimal solution given sufﬁcient time and memory[11]. UCT has become the gold standard for Monte-Carlo basedplanning in Markov decision processes [10].To rollout at each state, we use a uniform random action selectionfrom the set of permissible actions at each state. The permissibleactions are the ones that do not cause any conﬂict over resourceacquisition. Subsequently, the best action is then chosen based onthe UCB1 policy. The amount of time UCT uses for rollouts is the timeout , and is a parameter that we must set carefully in our experi-ments, as it directly impacts the value of the sample-based solution.Although in some resource allocation settings lengthy decision pe-riods would not have any impact on the efﬁciency of allocations, ar-guably, the time for making allocation decisions can be importantin domains requiring urgent decisions such as emergency depart-ments and environments exposed to signiﬁcant change. Delayeddecisions for critical patients with acute conditions in emergencydepartments can have huge impact on effectiveness of treatments[6]. Moreover, the allocation solution may become useless by thetime an optimal decision is computed as a result of ﬂuctuations indemand, and hence, requires recomputing the allocation decision.We will compare to UCT using a number of different realistic time-out settings. .2 Heuristic methods

We use three heuristic methods. In the ﬁrst, only the agent’slevel of criticality is considered (we call this “sickest ﬁrst”). In thesecond, we use the reported regret values and only run one round ofthe auction-based allocation (so only one agent gets a resource ateach time step: the agent with the biggest regret for not getting it).In the third, patients are treated in the order they arrive (ﬁrst-come,ﬁrst-served or FCFS - a traditional healthcare method).

4. EXPERIMENTS AND RESULTS

We demonstrate our approach in simulations with realistic prob-abilistic models of different conditions (e.g. diseases) and healthand resource dynamics distributions. The simulations use a randomsampling of agent MDPs, drawn from a realistic prior distributionover these models. It is important to note that we are not simplydeﬁning a single patient MDP, but rather our results are averagesover randomly drawn MDPs: each simulated patient is different ineach simulation, but drawn from the same underlying distribution.We make three main assumptions. First, we assume that taskdurations are identical (e.g. it always takes one unit of time to con-sume each resource). The second assumption is that each agent isonly able to bid on a single resource at each bidding round (but eachbidding round includes a sequence of bids to determine the actionfor each MDP). The third assumption is that all patients arrive atthe same time.

We assume that the health variable H ∈ { healthy, sick, critical } ,and each resource variable R i ∈ { have, had, need } . Patients allstart (enter the hospital) with H = sick and, depending on theresources they acquire, their health state improves to healthy or de-grades to the critical condition. We further deﬁne a function toencode the states of the health variables as ν ( h ) = { , , } for h = { healthy, sick, critical } . We assume that there are D possi-ble conditions (diseases), each with a criticality level , a real number c d ∈ [1 , with c d = 2 being the most critical disease (makes thepatient become sicker faster).We ﬁrst assume a multinomial distribution over the D conditionsdrawn from a set D , such that each patient has condition d ∈ D with probability φ d ( d ) . In the following, we assume conditions tobe evenly distributed: φ d ( d ) = 1 / |D| , although in practice thisdistribution would reﬂect the current condition distribution in thepopulation, community or hospital. Each condition has a conditionproﬁle that speciﬁes a set of resources in a speciﬁc order that isderived from the clinical practice guidelines or the medical path-way , a distribution over health state progression models, Ω H , anda distribution over resource obtention models, Λ R .The medical pathway can be speciﬁed either within the Ω H (bymaking any set of r not on the pathway lead to non-progressionof the health state), or within Λ R (by making it impossible to getresource allocations outside the pathway). We choose the latterin these experiments, but in practice the pathway may need to bespeciﬁed by a combination of both, particularly if there is non-determinism in the pathways (i.e. different pathways can be chosenwith different predicted outcomes). We assume that pathways forall agents are a linear chain through the required resources for eachcondition.For our experiments, we have built priors over Ω H and Λ R basedon our prior knowledge of the health domain. We have made thesepriors reasonably realistic (capture some of the main properties ofthis domain), and sufﬁciently non-speciﬁc to allow for a wide range of randomly drawn transition functions in the patient MDPs. Inpractice, these priors would be elicited from experts or learned fromdata. Health state progression model : For each simulated agent, Ω H is drawn from a Dirichlet prior distribution over the three values of H (cid:48) that puts more mass on the probability of healthier states (com-pared to the current health state) if the required resources are ob-tained, but more mass on the probability of sicker states if the dis-ease is more critical. More precisely, deﬁne ω H ∼ Dir ( α H ( d, r )) where α H is a triple of values over H = { healthy, sick, critical } and | ω H | = 1 . If all the required resources are r = had in r , then α H ( d, r ) = (12 , c d , c d ) . If all required resources are either r = had , or r = have , then α H ( d, r ) = (12 , c d , c d ) . Finally,if all the resources are needed, then α H ( d, r ) = (4 , c d , c d ) .For all the other values of r , i.e. the ones with partial resourcesneeded, we deﬁne α H ( d, r ) = (4 , c d , c d ) . Now for samplingpurposes, we use these Dirichlet priors as parameters of multino-mial distributions to sample the progression of health state. Wehave assumed similar progression of health over health states forall possible transitions based on ω H : ( ω H, , ω H, , ω H, ) . Thus, Ω H ≡ P ( h (cid:48) | h, r ) =  ( ω H, , ω H, , ω H, ) if h = sick ( ω H, , ω H, , ω H, ) if h = healthy ( ω H, , ω H, , ω H, ) if h = critical where ω H,i is the i th element of ω H . Resource obtention model : For each simulated agent, Λ R isdrawn from a Dirichlet prior distribution over the three values of R (cid:48) that puts more mass on the probability of getting a resource if itis the next in the medical pathway, and if the patient is more sick(so their regret and bids will be larger, making it more likely theywill get the resource). However, the probability mass shifts towardsnot getting a resource as N gets larger (so the more agents in thesystem, the less likely it is to get a resource). Recall from abovethat this model is meant to summarize the joint actions of N otheragents, as would have been modeled in a full dec-POMDP solu-tion. An adequate summary is important for good performance,and while we do not claim that the following prior is optimal, webelieve it to be a good representation for these simulations. Ideallythis function would be computed from the complete model directly,or learned from data. We deﬁne Λ R ∼ Dir ( α r ( N, h, r )) where α r is a triple of values over R = { have, had, need } . We deﬁne ν (cid:48) ( h ) = (1 , , for h = ( healthy, sick, critical ) . If all re-sources in r are either had or have , then α r = (10 ν (cid:48) ( h ) , ν (cid:48) ( h ) , N ) .If the previous resource in the medical pathway is need , then α r =( ν (cid:48) ( h ) , ν (cid:48) ( h ) , N ) . Finally, if all resources are needed, then α r = ( ν (cid:48) ( h ) , ν (cid:48) ( h ) , N ) . Reward function : Φ( h, h (cid:48) ) is ﬁxed for all the agents, and re-wards agents for becoming healthy, but penalizes them for stay-ing sick or going to the critical state. More precisely: for h (cid:48) =( healthy, sick, critical ) , Φ( h = healthy, h (cid:48) ) = (10 , − , − , Φ( h = sick, h (cid:48) ) = (15 , , − , and Φ( h = critical, h (cid:48) ) =(5 , , − . Further, once an patient is healthy and has receivedall resources, they are discharged and receive no further reward. We ran each of the benchmarks on a machine with 3.4GHz Quad-Core AMD and 4GB RAM available. We compare our auction-based coordinated MDP approach with (AucMDP-RegIter) and with-out (AucMDP-Reg) iteration using the expected regret bidding mech-anism. We also compare to a version where agents only bid theirexpected values, not regrets (AucMDP-Iter), FCFS, sickest-ﬁrst,and sample-based (UCT). Each simulated patient is randomly as-signed a condition proﬁle and then an MDP model with parameters umber of Agents |N| V a l ue o f R e s ou r c e A ss i gn m en t pe r A gen t l l l l ll l l l ll l l l ll l l l ll l l l ll l l l l approach l AucMDP−Iter l AucMDP−Reg l AucMDP−RegIter l FCFS l Sickest−First l UCT (a)

Number of Agents |N| V a l ue o f R e s ou r c e A ss i gn m en t pe r A gen t −20−10010 l l l l l l ll l l l l l ll l l l l l ll l l l l l l l approach l AucMDP−RegIter l FCFS l Sickest−First l UCT (b)Figure 2: Evaluation of various approaches based on expected regret (AucMDP-Reg), expected value with iteration (AucMDP-Iter), expectedregret with iteration (AucMDP-RegIter), and UCT with R = 4 , D = 4 . (a): Timeout is 300 seconds, τ = 10 N (b): Timeout is 120 seconds, τ = 10 N Number of Agents, N V a l ue o f R e s ou r c e A ss i gn m en t pe r A gen t −60−50−40−30−20−100 l l l l ll l l l ll l l l ll l l l l l

10 15 20 25 30 approach l AucMDP−RegIter l FCFS l Sickest−First l UCT (a)

Resources Required V a l ue pe r A gen t −50510 l l l ll l l l approach l AucMDP−RegIter l UCT (b)Figure 3: (a) Scaling to 30 agents, UCT with 10mins timeout and τ = 20 , R = 4 , D = 4 (b) Increasing required resources (actions), UCTwith 60 seconds timeout and N = 6 randomly drawn from the Dirichlet distributions deﬁned above isassigned. 100 trials are done for each randomly drawn set of con-ditions and MDPs, and this is repeated 10 times. For the UCTresults, we ran trials, also repeated times.We present means and standard deviations over these simula-tions. We ﬁrst present results with 4 total resources types and eachagent requiring 4 resources based on randomly assigned conditionproﬁles (Figure 2a). The y-axis is the average reward per patientgathered over an entire trial. We use a horizon that depends onthe number of agents ( τ = 10 N ), and UCT is given a 300 sec-ond timeout. The total computation time of the complete alloca-tions for the AucMDP approach is less than 10 seconds for prob-lems with 10 agents, and this computation time increases linearlywith the number of agents and resources (as opposed to exponentialgrowth in the MMDP case). We can see that the two AucMDP it-erative approaches perform similarly, and outperform the heuristicapproaches for N > . UCT is given sufﬁcient time to outperformall other approaches.Figure 2b shows the performance of our approach in a more re-alistic scenario with timeout set to a maximum of 120 seconds forrollouts. Similarly, each agent requires 4 resources. When the num-ber of agents increases to more than 8 agents, UCT underperforms compared to AucMDP, providing a policy as inferior as FCFS orsickest-ﬁrst. This is mostly due to the fact that the number of pos-sible actions grows exponentially by adding more agents, and thus,UCT requires signiﬁcantly more rollouts in the action explorationphase. Figure 3a shows a further scaling to N = 30 , again showingthat our AucMDP approach outperforms the other methods for thelarger problems. The number of joint actions also grows exponen-tially when the number of resources required by each agent is in-creased, since there are more individual options, but our AucMDPhandles this well as a result of linear growth in the number of ac-tions (Figure 3b).As more resources are added into the system, the performanceof approaches such as FCFS and sickest-ﬁrst get closer to our ap-proach because more diverse sets of resources are deﬁned by con-dition proﬁles. Figure 4a denotes that introducing more resourcesyields more diversity in resource requirements: the allocation prob-lem becomes “easier” to solve (fewer conﬂicts of interest), i.e., thesmaller number of resources results in harder allocation. Figure 4bshows results of further scaling our AucMDP approach to 50 agentseach requiring 10 resources with 10 condition proﬁles. , Total Resource Types V a l ue o f R e s ou r c e A ss i gn m en t pe r A gen t −4−2024 l l l l l l l l ll l l l l l l l ll l l l l l l l l approach l AucMDP−RegIter l FCFS l Sickest−First (a)

N, Number of Agents V a l ue o f R e s ou r c e A ss i gn m en t pe r A gen t −30−25−20−15−10−5 l l l l ll l l l ll l l l l

10 20 30 40 50 approach l AucMDP−RegIter l FCFS l Sickest−First (b)Figure 4: (a) Varying total resource types R = 20 , D = 5 , N = 10 , more diversity in resource requirements results in fewer resourceconﬂicts, (b) Scaling our auction-based coordination approach to N = 50 , R = 10 , D = 10 : Comparison with traditionally practicedheuristic methods in healthcare.

5. RELATED WORK AND CONCLUSION

Our approach to coordinating MDPs contrasts with those of mul-tiagent MDPs [5] and dec-MDPs [9] in ﬁnding exact solutions,which face complexity problems for large-scale problems such asours [3]. Instead, we offer an approximation method that collapsesthe state space of each agent down to only features that are availablelocally, and uses averaged effects of other agents for coordination.This is similar in spirit to [4] where effects of actions are estimatedby agents (but without the central coordination, as in our work).Our approach to resource allocation assumes additive utility in-dependence, as in [13], and has state and action spaces decomposedinto sets of features, with each feature relevant to only one subtask,but for cooperative settings, to maximize global utility. The useof auctions to coordinate local preferences through MDPs is alsoproposed in [8] where individual MDPs are submitted to a centraldecision maker to eventually solve the winner determination prob-lem through a mixed integer linear program (MILP). However, thismodel only provides one-shot allocations and is not applicable toenvironments with dynamic agents or resources. Multiple alloca-tion phases are addressed in [20], but the solution incurs greatercommunication overload with full agent preferences being mod-eled. Both approaches require a full preference model of all agentsand their MDPs to be submitted to the auctioneer, which increasesthe computation effort on the side of the auctioneer for solving anMMDP and requires complicated (and often large) communicationoverload while raising privacy concerns. The work of [12] also ad-dresses cooperative scenarios using auctions for allocating tasks toagents with ﬁxed types and no individual preference models. How-ever, we employ a multi-round mechanism to assign multiple re-sources to dynamic agents, with expected regret dictating winnerdetermination.The problem of medical resource allocation is perhaps best ad-dressed to date by [17, 18] which also integrates a health-basedutility function to address fairness based on the severity of healthstates. This model does not, however, consider temporal depen-dency when determining allocations and our approach of consid-ering future events provides a broader consideration of possibleuncertainty. Markov decision processes have been used to modelelective (non-emergency) patient scheduling in [15].In all, our auction-based MDP approach addresses dynamic allo-cation of resources using multiagent stochastic planning, employ- ing an auction mechanism to converge fast with low communica-tion cost. Our experiments demonstrate effectiveness in achievingglobal utility, using regret, for large-scale medical applications.Future work includes exploring auction-coordinated POMDPs [4]to estimate resource demands, and learning resource models fromdata. We are also interested in studying combinatorial biddingmechanisms [7, 19], and bidding languages [14] in order to opti-mize allocations based on richer preferences. Online mechanismsand dynamic auctions [16] may also be of value to consider, to con-tinue to explore changing environments.

6. ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for their help-ful comments.

7. REFERENCES [1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysisof the multiarmed bandit problem.

Machine learning ,47(2):235–256, 2002.[2] R.E. Bellman.

Dynamic programming . Courier DoverPublications, 2003.[3] D.S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein.The complexity of decentralized control of Markov decisionprocesses.

Mathematics of operations research ,27(4):819–840, 2002.[4] Aurélie Beynier and Abdel-Illah Mouaddib. An iterativealgorithm for solving constrained decentralized Markovdecision processes. In

Proceedings of AAAI , 2006.[5] Craig Boutilier. Sequential optimality and coordination inmultiagent systems. In

IJCAI , pages 478–485, 1999.[6] D.B. Chalﬁn, S. Trzeciak, A. Likourezos, B.M. Baumann,R.P. Dellinger, et al. Impact of delayed transfer of criticallyill patients from the emergency department to the intensivecare unit*.

Critical care medicine , 35(6):1477–1483, 2007.[7] P. Cramton, Y. Shoham, and R. Steinberg.

Introduction tocombinatorial auctions . MIT Press, 2006.[8] D.A. Dolgov and E.H. Durfee. Resource allocation amongagents with MDP-induced preferences.

Journal of ArtiﬁcialIntelligence Research , 27(1):505–549, 2006.[9] C.V. Goldman and S. Zilberstein. Decentralized control ofcooperative systems: Categorization and complexitynalysis.

Journal of Artiﬁcial Intelligence Research ,22(1):143–174, 2004.[10] Thomas Keller and Patrick Eyerich. PROST: Probabilisticplanning based on UCT. In

Proc. ICAPS , 2012.[11] L. Kocsis and C. Szepesvári. Bandit based monte-carloplanning.

Machine Learning: ECML 2006 , pages 282–293,2006.[12] S. Koenig, C. Tovey, X. Zheng, and I. Sungur. Sequentialbundle-bid single-sale auction algorithms for decentralizedcontrol. In

Proceedings of the international joint conferenceon artiﬁcial intelligence , pages 1359–1365, 2007.[13] Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, LeonidPeshkin, Leslie Pack Kaelbling, Thomas Dean, and CraigBoutilier. Solving very large weakly coupled Markovdecision processes. In

Proceedings AAAI , pages 165–172,1998.[14] N. Nisan. Bidding and allocation in combinatorial auctions.In

Proceedings of the 2nd ACM conference on Electroniccommerce , pages 1–12. ACM, 2000.[15] L.G.N. Nunes, S.V. de Carvalho, and R.C.M. Rodrigues.Markov decision process applied to the control of hospitalelective admissions.

Artiﬁcial intelligence in medicine ,47(2):159–171, 2009.[16] D.C. Parkes. Online mechanisms.

Algorithmic Game Theory,ed. N. Nisan, T. Roughgarden, E. Tardos, and V. Vazirani ,pages 411–439, 2007.[17] T.O. Paulussen, N.R. Jennings, K.S. Decker, and A. Heinzl.Distributed patient scheduling in hospitals. In

InternationalJoint Conference on Artiﬁcial Intelligence , volume 18, pages1224–1232. Citeseer, 2003.[18] T.O. Paulussen, A. Zoller, F. Rothlauf, A. Heinzl,L. Braubach, A. Pokahr, and W. Lamersdorf. Agent-basedpatient scheduling in hospitals.

Multiagent Engineering ,pages 255–275, 2006.[19] S.J. Rassenti, V.L. Smith, and R.L. Bulﬁn. A combinatorialauction mechanism for airport time slot allocation.

The BellJournal of Economics , pages 402–417, 1982.[20] J. Wu and E.H. Durfee. Sequential resource allocation inmultiagent systems with uncertainties. In