[PDF] A Framework for Sequential Planning in Multi-Agent Settings

Abstract

This paper extends the framework of partially observable Markov decision processes (POMDPs) to multi-agent settings by incorporating the notion of agent models into the state space. Agents maintain beliefs over physical states of the environment and over models of other agents, and they use Bayesian updates to maintain their beliefs over time. The solutions map belief states to actions. Models of other agents may include their belief states and are related to agent types considered in games of incomplete information. We express the agents autonomy by postulating that their models are not directly manipulable or observable by other agents. We show that important properties of POMDPs, such as convergence of value iteration, the rate of convergence, and piece-wise linearity and convexity of the value functions carry over to our framework. Our approach complements a more traditional approach to interactive settings which uses Nash equilibria as a solution paradigm. We seek to avoid some of the drawbacks of equilibria which may be non-unique and do not capture off-equilibrium behaviors. We do so at the cost of having to represent, process and continuously revise models of other agents. Since the agents beliefs may be arbitrarily nested, the optimal solutions to decision making problems are only asymptotically computable. However, approximate belief updates and approximately optimal plans are computable. We illustrate our framework using a simple application domain, and we show examples of belief updates and value functions.

Full PDF

JJournal of Artiﬁcial Intelligence Research 24 (2005) 49-79 Submitted 09/04; published 07/05

A Framework for Sequential Planning in Multi-Agent Settings

Piotr J. Gmytrasiewicz

PIOTR @ CS . UIC . EDU

Prashant Doshi

PDOSHI @ CS . UIC . EDU

Department of Computer ScienceUniversity of Illinois at Chicago851 S. Morgan StChicago, IL 60607

Abstract

This paper extends the framework of partially observable Markov decision processes (POMDPs)to multi-agent settings by incorporating the notion of agent models into the state space. Agentsmaintain beliefs over physical states of the environment and over models of other agents, and theyuse Bayesian updates to maintain their beliefs over time. The solutions map belief states to actions.Models of other agents may include their belief states and are related to agent types considered ingames of incomplete information. We express the agents’ autonomy by postulating that their mod-els are not directly manipulable or observable by other agents. We show that important propertiesof POMDPs, such as convergence of value iteration, the rate of convergence, and piece-wise linear-ity and convexity of the value functions carry over to our framework. Our approach complements amore traditional approach to interactive settings which uses Nash equilibria as a solution paradigm.We seek to avoid some of the drawbacks of equilibria which may be non-unique and do not captureoff-equilibrium behaviors. We do so at the cost of having to represent, process and continuouslyrevise models of other agents. Since the agent’s beliefs may be arbitrarily nested, the optimal so-lutions to decision making problems are only asymptotically computable. However, approximatebelief updates and approximately optimal plans are computable. We illustrate our framework usinga simple application domain, and we show examples of belief updates and value functions.

1. Introduction

We develop a framework for sequential rationality of autonomous agents interacting with otheragents within a common, and possibly uncertain, environment. We use the normative paradigm ofdecision-theoretic planning under uncertainty formalized as partially observable Markov decisionprocesses (POMDPs) (Boutilier, Dean, & Hanks, 1999; Kaelbling, Littman, & Cassandra, 1998;Russell & Norvig, 2003) as a point of departure. Solutions of POMDPs are mappings from anagent’s beliefs to actions. The drawback of POMDPs when it comes to environments populated byother agents is that other agents’ actions have to be represented implicitly as environmental noisewithin the, usually static, transition model. Thus, an agent’s beliefs about another agent are not partof solutions to POMDPs.The main idea behind our formalism, called interactive POMDPs (I-POMDPs), is to allowagents to use more sophisticated constructs to model and predict behavior of other agents. Thus,we replace “ﬂat” beliefs about the state space used in POMDPs with beliefs about the physicalenvironment and about the other agent(s), possibly in terms of their preferences, capabilities, andbeliefs. Such beliefs could include others’ beliefs about others, and thus can be nested to arbitrarylevels. They are called interactive beliefs. While the space of interactive beliefs is very rich andupdating these beliefs is more complex than updating their “ﬂat” counterparts, we use the value c (cid:13) MYTRASIEWICZ & D

OSHI function plots to show that solutions to I-POMDPs are at least as good as, and in usual cases superiorto, comparable solutions to POMDPs. The reason is intuitive – maintaining sophisticated models ofother agents allows more reﬁned analysis of their behavior and better predictions of their actions.I-POMDPs are applicable to autonomous self-interested agents who locally compute what ac-tions they should execute to optimize their preferences given what they believe while interactingwith others with possibly conﬂicting objectives. Our approach of using a decision-theoretic frame-work and solution concept complements the equilibrium approach to analyzing interactions as usedin classical game theory (Fudenberg & Tirole, 1991). The drawback of equilibria is that there couldbe many of them (non-uniqueness), and that they describe agent’s optimal actions only if, and when,an equilibrium has been reached (incompleteness). Our approach, instead, is centered on optimalityand best response to anticipated action of other agent(s), rather then on stability (Binmore, 1990;Kadane & Larkey, 1982). The question of whether, under what circumstances, and what kind ofequilibria could arise from solutions to I-POMDPs is currently open.Our approach avoids the difﬁculties of non-uniqueness and incompleteness of traditional equi-librium approach, and offers solutions which are likely to be better than the solutions of traditionalPOMDPs applied to multi-agent settings. But these advantages come at the cost of processing andmaintaining possibly inﬁnitely nested interactive beliefs. Consequently, only approximate beliefupdates and approximately optimal solutions to planning problems are computable in general. Wedeﬁne a class of ﬁnitely nested I-POMDPs to form a basis for computable approximations to in-ﬁnitely nested ones. We show that a number of properties that facilitate solutions of POMDPs carryover to ﬁnitely nested I-POMDPs. In particular, the interactive beliefs are sufﬁcient statistics for thehistories of agent’s observations, the belief update is a generalization of the update in POMDPs, thevalue function is piece-wise linear and convex, and the value iteration algorithm converges at thesame rate.The remainder of this paper is structured as follows. We start with a brief review of relatedwork in Section 2, followed by an overview of partially observable Markov decision processes inSection 3. There, we include a simple example of a tiger game. We introduce the concept ofagent types in Section 4. Section 5 introduces interactive POMDPs and deﬁnes their solutions. Theﬁnitely nested I-POMDPs, and some of their properties are introduced in Section 6. We continuewith an example application of ﬁnitely nested I-POMDPs to a multi-agent version of the tiger gamein Section 7. There, we show examples of belief updates and value functions. We conclude witha brief summary and some current research issues in Section 8. Details of all proofs are in theAppendix.

2. Related Work

Our work draws from prior research on partially observable Markov decision processes, whichrecently gained a lot of attention within the AI community (Smallwood & Sondik, 1973; Monahan,1982; Lovejoy, 1991; Hausktecht, 1997; Kaelbling et al., 1998; Boutilier et al., 1999; Hauskrecht,2000).The formalism of Markov decision processes has been extended to multiple agents giving rise tostochastic games or Markov games (Fudenberg & Tirole, 1991). Traditionally, the solution conceptused for stochastic games is that of Nash equilibria. Some recent work in AI follows that tradition(Littman, 1994; Hu & Wellman, 1998; Boutilier, 1999; Koller & Milch, 2001). However, as wementioned before, and as has been pointed out by some game theorists (Binmore, 1990; Kadane &

50 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS

Larkey, 1982), while Nash equilibria are useful for describing a multi-agent system when, and if,it has reached a stable state, this solution concept is not sufﬁcient as a general control paradigm.The main reasons are that there may be multiple equilibria with no clear way to choose among them(non-uniqueness), and the fact that equilibria do not specify actions in cases in which agents believethat other agents may not act according to their equilibrium strategies (incompleteness).Other extensions of POMDPs to multiple agents appeared in AI literature recently (Bernstein,Givan, Immerman, & Zilberstein, 2002; Nair, Pynadath, Yokoo, Tambe, & Marsella, 2003). Theyhave been called decentralized POMDPs (DEC-POMDPs), and are related to decentralized controlproblems (Ooi & Wornell, 1996). DEC-POMDP framework assumes that the agents are fully coop-erative, i.e., they have common reward function and form a team. Furthermore, it is assumed thatthe optimal joint solution is computed centrally and then distributed among the agents for execution.From the game-theoretic side, we are motivated by the subjective approach to probability ingames (Kadane & Larkey, 1982), Bayesian games of incomplete information (see Fudenberg &Tirole, 1991; Harsanyi, 1967, and references therein), work on interactive belief systems (Harsanyi,1967; Mertens & Zamir, 1985; Brandenburger & Dekel, 1993; Fagin, Halpern, Moses, & Vardi,1995; Aumann, 1999; Fagin, Geanakoplos, Halpern, & Vardi, 1999), and insights from research onlearning in game theory (Fudenberg & Levine, 1998). Our approach, closely related to decision-theoretic (Myerson, 1991), or epistemic (Ambruster & Boge, 1979; Battigalli & Siniscalchi, 1999;Brandenburger, 2002) approach to game theory, consists of predicting actions of other agents givenall available information, and then of choosing the agent’s own action (Kadane & Larkey, 1982).Thus, the descriptive aspect of decision theory is used to predict others’ actions, and its prescriptiveaspect is used to select agent’s own optimal action.The work presented here also extends previous work on Recursive Modeling Method (RMM)(Gmytrasiewicz & Durfee, 2000), but adds elements of belief update and sequential planning.

3. Background: Partially Observable Markov Decision Processes

A partially observable Markov decision process (POMDP) (Monahan, 1982; Hausktecht, 1997;Kaelbling et al., 1998; Boutilier et al., 1999; Hauskrecht, 2000) of an agent i is deﬁned as POMDP i = h S, A i , T i , Ω i , O i , R i i (1)where: S is a set of possible states of the environment. A i is a set of actions agent i can execute. T i isa transition function – T i : S × A i × S → [0 , which describes results of agent i ’s actions. Ω i is theset of observations the agent i can make. O i is the agent’s observation function – O i : S × A i × Ω i → [0 , which speciﬁes probabilities of observations given agent’s actions and resulting states. Finally, R i is the reward function representing the agent i ’s preferences – R i : S × A i → ℜ .In POMDPs, an agent’s belief about the state is represented as a probability distribution over S .Initially, before any observations or actions take place, the agent has some (prior) belief, b i . Aftersome time steps, t , we assume that the agent has t + 1 observations and has performed t actions .These can be assembled into agent i ’s observation history : h ti = { o i , o i , .., o t − i , o ti } at time t . Let H i denote the set of all observation histories of agent i . The agent’s current belief, b ti over S , iscontinuously revised based on new observations and expected results of performed actions. It turns

1. We assume that action is taken at every time step; it is without loss of generality since any of the actions maybe aNo-op. MYTRASIEWICZ & D

OSHI out that the agent’s belief state is sufﬁcient to summarize all of the past observation history andinitial belief; hence it is called a sufﬁcient statistic. The belief update takes into account changes in initial belief, b t − i , due to action, a t − i , executedat time t − , and the new observation, o ti . The new belief, b ti , that the current state is s t , is: b ti ( s t ) = βO i ( o ti , s t , a t − i ) X s t − ∈ S b t − i ( s t − ) T i ( s t , a ti , s t − ) (2)where β is the normalizing constant.It is convenient to summarize the above update performed for all states in S as b ti = SE ( b t − i , a t − i , o ti ) (Kaelbling et al., 1998). The agent’s optimality criterion, OC i , is needed to specify how rewards acquired over time arehandled. Commonly used criteria include: • A ﬁnite horizon criterion, in which the agent maximizes the expected value of the sum of thefollowing T rewards: E ( P Tt =0 r t ) . Here, r t is a reward obtained at time t and T is the lengthof the horizon. We will denote this criterion as fh T . • An inﬁnite horizon criterion with discounting, according to which the agent maximizes E ( P ∞ t =0 γ t r t ) , where < γ < is a discount factor. We will denote this criterion as ih γ . • An inﬁnite horizon criterion with averaging, according to which the agent maximizes theaverage reward per time step. We will denote this as ih AV .In what follows, we concentrate on the inﬁnite horizon criterion with discounting, but our ap-proach can be easily adapted to the other criteria.The utility associated with a belief state, b i is composed of the best of the immediate rewardsthat can be obtained in b i , together with the discounted expected sum of utilities associated withbelief states following b i : U ( b i ) = max a i ∈ A i (cid:26)X s ∈ S b i ( s ) R i ( s, a i ) + γ X o i ∈ Ω i P r ( o i | a i , b i ) U ( SE i ( b i , a i , o i )) (cid:27) (3)Value iteration uses the Equation 3 iteratively to obtain values of belief states for longer timehorizons. At each step of the value iteration the error of the current value estimate is reduced by thefactor of at least γ (see for example Russell & Norvig, 2003, Section 17.2.) The optimal action, a ∗ i ,is then an element of the set of optimal actions, OP T ( b i ) , for the belief state, deﬁned as: OP T ( b i ) = argmax a i ∈ A i (cid:26)X s ∈ S b i ( s ) R i ( s, a i ) + γ X o i ∈ Ω i P r ( o i | a i , b i ) U ( SE ( b i , a i , o i )) (cid:27) (4)

2. See (Smallwood & Sondik, 1973) for proof.

52 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS V a l ue F un c t i on ( U ) p_i(TL)POMDP with noise POMDP OL ORL p (TL) i Figure 1: The value function for single agent tiger game with time horizon of length 1, OC i = fh .Actions are: open right door - OR, open left door - OL, and listen - L. For this value ofthe time horizon the value function for a POMDP with noise factor is identical to singleagent POMDP. We brieﬂy review the POMDP solutions to the tiger game (Kaelbling et al., 1998). Our purpose isto build on the insights that POMDP solutions provide in this simple case to illustrate solutions tointeractive versions of this game later.The traditional tiger game resembles a game-show situation in which the decision maker hasto choose to open one of two doors behind which lies either a valuable prize or a dangerous tiger.Apart from actions that open doors, the subject has the option of listening for the tiger’s growlcoming from the left, or the right, door. However, the subject’s hearing is imperfect, with givenpercentages (say, 15%) of false positive and false negative occurrences. Following (Kaelbling et al.,1998), we assume that the value of the prize is 10, that the pain associated with encountering thetiger can be quantiﬁed as -100, and that the cost of listening is -1.The value function, in Figure 1, shows values of various belief states when the agent’s timehorizon is equal to 1. Values of beliefs are based on best action available in that belief state, asspeciﬁed in Eq. 3. The state of certainty is most valuable – when the agent knows the location ofthe tiger it can open the opposite door and claim the prize which certainly awaits. Thus, when theprobability of tiger location is 0 or 1, the value is 10. When the agent is sufﬁciently uncertain, itsbest option is to play it safe and listen; the value is then -1. The agent is indifferent between openingdoors and listening when it assigns probabilities of 0.9 or 0.1 to the location of the tiger.Note that, when the time horizon is equal to 1, listening does not provide any useful informationsince the game does not continue to allow for the use of this information. For longer time horizonsthe beneﬁts of results of listening results in policies which are better in some ranges of initial belief.Since the value function is composed of values corresponding to actions, which are linear in prob- MYTRASIEWICZ & D

OSHI -202468 0 0.2 0.4 0.6 0.8 1 V a l ue F un c t i on ( U ) p_i(TL)POMDP with noise POMDP L\();OL\(*)OL\();L\(*) L\();L\(GL),OL\(GR) L\();L\(*) L\();OR\(GL),L\(GR) L\();OR\(*)OR\();L\(*) p (TL) i Figure 2: The value function for single agent tiger game compared to an agent facing a noise fac-tor, for horizon of length 2. Policies corresponding to value lines are conditional plans.Actions, L, OR or OL, are conditioned on observational sequences in parenthesis. Forexample L \ ();L \ (GL),OL \ (GR) denotes a plan to perform the listening action, L, at thebeginning (list of observations is empty), and then another L if the observation is growlfrom the left (GL), and open the left door, OL, if the observation is GR. ∗ is a wildcardwith the usual interpretation.ability of tiger location, the value function has the property of being piece-wise linear and convex(PWLC) for all horizons. This simpliﬁes the computations substantially.In Figure 2 we present a comparison of value functions for horizon of length 2 for a singleagent, and for an agent facing a more noisy environment. The presence of such noise could bedue to another agent opening the doors or listening with some probabilities. Since POMDPs donot include explicit models of other agents, these noise actions have been included in the transitionmodel, T .Consequences of folding noise into T are two-fold. First, the effectiveness of the agent’s optimalpolicies declines since the value of hearing growls diminishes over many time steps. Figure 3 depictsa comparison of value functions for horizon of length 3. Here, for example, two consecutive growlsin a noisy environment are not as valuable as when the agent knows it is acting alone since the noisemay have perturbed the state of the system between the growls. For time horizon of length 1 thenoise does not matter and the value vectors overlap, as in Figure 1.Second, since the presence of another agent is implicit in the static transition model, the agentcannot update its model of the other agent’s actions during repeated interactions. This effect be-comes more important as time horizon increases. Our approach addresses this issue by allowingexplicit modeling of the other agent(s). This results in policies of superior quality, as we show inSection 7. Figure 4 shows a policy for an agent facing a noisy environment for time horizon of 3.We compare it to the corresponding I-POMDP policy in Section 7. Note that it is slightly different

3. We assumed that, due to the noise, either door opens with probabilities of 0.1 at each turn, and nothing happens withthe probability 0.8. We explain the origin of this assumption in Section 7.

54 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS V a l ue F un c t i on ( U ) p_i(TL)POMDP with noise POMDP OR\();L\(*);L\(*)L\();L\(*);OR\(*)OL\();L\(*);L\(*)L\();L\(*);OL\(*) L\();L\(*);OL\(GR;GR),L\(?) L\();L\(*);OR\(GL;GL),L\(?)L\();OR\(GL),L\(GR);OR\(GR;GL),L\(?) L\();L\(GL),OL\(GR);OL\(GL;GR),L\(?) L\();L\(*);OR\(GL;GL),OL\(GR;GR),L\(?) p (TL) i Figure 3: The value function for single agent tiger game compared to an agent facing a noise factor,for horizon of length 3. The “?” in the description of a policy stands for any of theperceptual sequences not yet listed in the description of the policy.

OL L ORL LLL L L LLOL OROL OR*GR GL* *GL GL GR GL[0−0.045) [0.045−0.135) [0.135−0.175) [0.175−0.825) [0.825−0.865) [0.865−0.955) [0.955−1]GR GLGR GL GR GLGR GR* *

Figure 4: The policy graph corresponding to value function of POMDP with noise depicted inFig. 3. MYTRASIEWICZ & D

OSHI than the policy without noise in the example by Kaelbling, Littman and Cassandra (1998) due todifferences in value functions.

4. Agent Types and Frames

The POMDP deﬁnition includes parameters that permit us to compute an agent’s optimal behavior, conditioned on its beliefs. Let us collect these implementation independent factors into a constructwe call an agent i ’s type . Deﬁnition 1 (Type).

A type of an agent i is, θ i = h b i , A i , Ω i , T i , O i , R i , OC i i , where b i is agent i ’sstate of belief (an element of ∆( S ) ), OC i is its optimality criterion, and the rest of the elements areas deﬁned before. Let Θ i be the set of agent i ’s types. Given type, θ i , and the assumption that the agent is Bayesian-rational, the set of agent’s optimalactions will be denoted as OP T ( θ i ) . In the next section, we generalize the notion of type to situa-tions which include interactions with other agents; it then coincides with the notion of type used inBayesian games (Fudenberg & Tirole, 1991; Harsanyi, 1967).It is convenient to deﬁne the notion of a frame , b θ i , of agent i : Deﬁnition 2 (Frame).

A frame of an agent i is, b θ i = h A i , Ω i , T i , O i , R i , OC i i . Let b Θ i be the set ofagent i ’s frames. For brevity one can write a type as consisting of an agent’s belief together with its frame: θ i = h b i , b θ i i .In the context of the tiger game described in the previous section, agent type describes theagent’s actions and their results, the quality of the agent’s hearing, its payoffs, and its belief aboutthe tiger location.Realistically, apart from implementation-independent factors grouped in type, an agent’s be-havior may also depend on implementation-speciﬁc parameters, like the processor speed, memoryavailable, etc. These can be included in the (implementation dependent, or complete ) type, increas-ing the accuracy of predicted behavior, but at the cost of additional complexity. Deﬁnition and useof complete types is a topic of ongoing work.

5. Interactive POMDPs

As we mentioned, our intention is to generalize POMDPs to handle presence of other agents. Wedo this by including descriptions of other agents (their types for example) in the state space. Forsimplicity of presentation, we consider an agent i , that is interacting with one other agent, j . Theformalism easily generalizes to larger number of agents. Deﬁnition 3 (I-POMDP). An interactive POMDP of agent i , I-POMDP i , is:I-POMDP i = h IS i , A, T i , Ω i , O i , R i i (5)

4. The issue of computability of solutions to POMDPs has been a subject of much research (Papadimitriou & Tsitsiklis,1987; Madani, Hanks, & Condon, 2003). It is of obvious importance when one uses POMDPs to model agents; wereturn to this issue later.

56 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS where: • IS i is a set of interactive states deﬁned as IS i = S × M j , interacting with agent i , where S is the set of states of the physical environment, and M j is the set of possible models of agent j . Each model, m j ∈ M j , is deﬁned as a triple m j = h h j , f j , O j i , where f j : H j → ∆( A j ) is agent j ’s function, assumed computable, which maps possible histories of j ’s observations todistributions over its actions. h j is an element of H j , and O j is a function specifying the way theenvironment is supplying the agent with its input. Sometimes we write model m j as m j = h h j , b m j i ,where b m j consists of f j and O j . It is convenient to subdivide the set of models into two classes.The subintentional models, SM j , are relatively simple, while the intentional models, IM j , use thenotion of rationality to model the other agent. Thus, M j = IM j ∪ SM j .Simple examples of subintentional models include a no-information model and a ﬁctitious playmodel, both of which are history independent. A no-information model (Gmytrasiewicz & Durfee,2000) assumes that each of the other agent’s actions is executed with equal probability. Fictitiousplay (Fudenberg & Levine, 1998) assumes that the other agent chooses actions according to a ﬁxedbut unknown distribution, and that the original agent’s prior belief over that distribution takes a formof a Dirichlet distribution. An example of a more powerful subintentional model is a ﬁnite statecontroller.The intentional models are more sophisticated in that they ascribe to the other agent beliefs,preferences and rationality in action selection. Intentional models are thus j ’s types, θ j = h b j , b θ j i ,under the assumption that agent j is Bayesian-rational. Agent j ’s belief is a probability distributionover states of the environment and the models of the agent i ; b j ∈ ∆( S × M i ) . The notion of a typewe use here coincides with the notion of type in game theory, where it is deﬁned as consisting ofall of the agent i ’s private information relevant to its decision making (Harsanyi, 1967; Fudenberg& Tirole, 1991). In particular, if agents’ beliefs are private information, then their types involvepossibly inﬁnitely nested beliefs over others’ types and their beliefs about others (Mertens & Zamir,1985; Brandenburger & Dekel, 1993; Aumann, 1999; Aumann & Heifetz, 2002). They are relatedto recursive model structures in our prior work (Gmytrasiewicz & Durfee, 2000). The deﬁnition ofinteractive state space is consistent with the notion of a completely speciﬁed state space put forwardby Aumann (1999). Similar state spaces have been proposed by others (Mertens & Zamir, 1985;Brandenburger & Dekel, 1993). • A = A i × A j is the set of joint moves of all agents. • T i is the transition model. The usual way to deﬁne the transition probabilities in POMDPsis to assume that the agent’s actions can change any aspect of the state description. In case of I-POMDPs, this would mean actions modifying any aspect of the interactive states, including otheragents’ observation histories and their functions, or, if they are modeled intentionally, their beliefsand reward functions. Allowing agents to directly manipulate other agents in such ways, however,violates the notion of agents’ autonomy. Thus, we make the following simplifying assumption:

5. If there are more agents, say

N > , then IS i = S × N − j =1 M j

6. Technically, according to our notation, ﬁctitious play is actually an ensemble of models.7. Dennet (1986) advocates ascribing rationality to other agent(s), and calls it ”assuming an intentional stance towardsthem”.8. Note that the space of types is by far richer than that of computable models. In particular, since the set of computablemodels is countable and the set of types is uncountable, many types are not computable models.9. Implicit in the deﬁnition of interactive beliefs is the assumption of coherency (Brandenburger & Dekel, 1993). MYTRASIEWICZ & D

OSHI

Model Non-manipulability Assumption (MNM):

Agents’ actions do not change the otheragents’ models directly.Given this simpliﬁcation, the transition model can be deﬁned as T i : S × A × S → [0 , Autonomy, formalized by the MNM assumption, precludes, for example, direct “mind control”,and implies that other agents’ belief states can be changed only indirectly, typically by changing theenvironment in a way observable to them. In other words, agents’ beliefs change, like in POMDPs,but as a result of belief update after an observation, not as a direct result of any of the agents’actions. • Ω i is deﬁned as before in the POMDP model. • O i is an observation function. In deﬁning this function we make the following assumption: Model Non-observability (MNO):

Agents cannot observe other’s models directly.Given this assumption the observation function is deﬁned as O i : S × A × Ω i → [0 , .The MNO assumption formalizes another aspect of autonomy – agents are autonomous in thattheir observations and functions, or beliefs and other properties, say preferences, in intentionalmodels, are private and the other agents cannot observe them directly. • R i is deﬁned as R i : IS i × A → ℜ . We allow the agent to have preferences over physicalstates and models of other agents, but usually only the physical state will matter.As we mentioned, we see interactive POMDPs as a subjective counterpart to an objective ex-ternal view in stochastic games (Fudenberg & Tirole, 1991), and also followed in some work inAI (Boutilier, 1999) and (Koller & Milch, 2001) and in decentralized POMDPs (Bernstein et al.,2002; Nair et al., 2003). Interactive POMDPs represent an individual agent’s point of view on theenvironment and the other agents, and facilitate planning and problem solving at the agent’s ownindividual level. I-POMDPs

We will show that, as in POMDPs, an agent’s beliefs over their interactive states are sufﬁcientstatistics, i.e., they fully summarize the agent’s observation histories. Further, we need to show howbeliefs are updated after the agent’s action and observation, and how solutions are deﬁned.The new belief state, b ti , is a function of the previous belief state, b t − i , the last action, a t − i ,and the new observation, o ti , just as in POMDPs. There are two differences that complicate beliefupdate when compared to POMDPs. First, since the state of the physical environment depends onthe actions performed by both agents the prediction of how the physical state changes has to bemade based on the probabilities of various actions of the other agent. The probabilities of other’sactions are obtained based on their models. Thus, unlike in Bayesian and stochastic games, we donot assume that actions are fully observable by other agents. Rather, agents can attempt to infer whatactions other agents have performed by sensing their results on the environment. Second, changes inthe models of other agents have to be included in the update. These reﬂect the other’s observationsand, if they are modeled intentionally, the update of the other agent’s beliefs. In this case, the agenthas to update its beliefs about the other agent based on what it anticipates the other agent observes

10. The possibility that agents can inﬂuence the observational capabilities of other agents can be accommodated byincluding the factors that can change sensing capabilities in the set S .11. Again, the possibility that agents can observe factors that may inﬂuence the observational capabilities of other agentsis allowed by including these factors in S .

58 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS and how it updates. As could be expected, the update of the possibly inﬁnitely nested belief overother’s types is, in general, only asymptotically computable.

Proposition 1. (Sufﬁciency)

In an interactive POMDP of agent i , i ’s current belief, i.e., the proba-bility distribution over the set S × M j , is a sufﬁcient statistic for the past history of i ’s observations. The next proposition deﬁnes the agent i ’s belief update function, b ti ( is t ) = P r ( is t | o ti , a t − i , b t − i ) ,where is t ∈ IS i is an interactive state. We use the belief state estimation function, SE θ i , as an ab-breviation for belief updates for individual states so that b ti = SE θ i ( b t − i , a t − i , o ti ) . τ θ i ( b t − i , a t − i , o ti , b ti ) will stand for P r ( b ti | b t − i , a t − i , o ti ) . Further below we also deﬁne the set oftype-dependent optimal actions of an agent, OP T ( θ i ) . Proposition 2. (Belief Update)

Under the MNM and MNO assumptions, the belief update functionfor an interactive POMDP h IS i , A, T i , Ω i , O i , R i i , when m j in is t is intentional, is: b ti ( is t ) = β P is t − : b m t − j = b θ tj b t − i ( is t − ) P a t − j P r ( a t − j | θ t − j ) O i ( s t , a t − , o ti ) × T i ( s t − , a t − , s t ) P o tj τ θ tj ( b t − j , a t − j , o tj , b tj ) O j ( s t , a t − , o tj ) (6) When m j in is t is subintentional the ﬁrst summation extends over is t − : b m t − j = b m tj , P r ( a t − j | θ t − j ) is replaced with P r ( a t − j | m t − j ) , and τ θ tj ( b t − j , a t − j , o tj , b tj ) is replaced with theKronecker delta function δ K ( APPEND ( h t − j , o tj ) , h tj ) . Above, b t − j and b tj are the belief elements of θ t − j and θ tj , respectively, β is a normalizing constant,and P r ( a t − j | θ t − j ) is the probability that a t − j is Bayesian rational for agent described by type θ t − j . This probability is equal to | OP T ( θ j ) | if a t − j ∈ OP T ( θ j ) , and it is equal to zero otherwise.We deﬁne OP T in Section 5.2. For the case of j ’s subintentional model, is = ( s, m j ) , h t − j and h tj are the observation histories which are part of m t − j , and m tj respectively, O j is the observationfunction in m tj , and P r ( a t − j | m t − j ) is the probability assigned by m t − j to a t − j . APPEND returnsa string with the second argument appended to the ﬁrst. The proofs of the propositions are in theAppendix.Proposition 2 and Eq. 6 have a lot in common with belief update in POMDPs, as should beexpected. Both depend on agent i ’s observation and transition functions. However, since agent i ’sobservations also depend on agent j ’s actions, the probabilities of various actions of j have to beincluded (in the ﬁrst line of Eq. 6.) Further, since the update of agent j ’s model depends on what j observes, the probabilities of various observations of j have to be included (in the second line ofEq. 6.) The update of j ’s beliefs is represented by the τ θ j term. The belief update can easily begeneralized to the setting where more than one other agents co-exist with agent i .

12. If the agent’s prior belief over IS i is given by a probability density function then the P is t − is replaced byan integral. In that case τ θ tj ( b t − j , a t − j , o tj , b tj ) takes the form of Dirac delta function over argument b t − j : δ D ( SE θ tj ( b t − j , a t − j , o tj ) − b tj ) . MYTRASIEWICZ & D

OSHI

Analogously to POMDPs, each belief state in I-POMDP has an associated value reﬂecting the max-imum payoff the agent can expect in this belief state: U ( θ i ) = max a i ∈ A i (cid:26)P is ER i ( is, a i ) b i ( is ) + γ P o i ∈ Ω i P r ( o i | a i , b i ) U ( h SE θ i ( b i , a i , o i ) , b θ i i ) (cid:27) (7)where, ER i ( is, a i ) = P a j R i ( is, a i , a j ) P r ( a j | m j ) . Eq. 7 is a basis for value iteration in I-POMDPs.Agent i ’s optimal action, a ∗ i , for the case of inﬁnite horizon criterion with discounting, is anelement of the set of optimal actions for the belief state, OP T ( θ i ) , deﬁned as: OP T ( θ i ) = argmax a i ∈ A i (cid:26)P is ER i ( is, a i ) b i ( is ) + γ P o i ∈ Ω i P r ( o i | a i , b i ) U ( h SE θ i ( b i , a i , o i ) , b θ i i ) (cid:27) (8)As in the case of belief update, due to possibly inﬁnitely nested beliefs, a step of value iterationand optimal actions are only asymptotically computable.

6. Finitely Nested I-POMDPs

Possible inﬁnite nesting of agents’ beliefs in intentional models presents an obvious obstacle tocomputing the belief updates and optimal solutions. Since the models of agents with inﬁnitelynested beliefs correspond to agent functions which are not computable it is natural to considerﬁnite nestings. We follow approaches in game theory (Aumann, 1999; Brandenburger & Dekel,1993; Fagin et al., 1999), extend our previous work (Gmytrasiewicz & Durfee, 2000), and constructﬁnitely nested I-POMDPs bottom-up. Assume a set of physical states of the world S , and twoagents i and j . Agent i ’s 0-th level beliefs, b i, , are probability distributions over S . Its 0-th leveltypes, Θ i, , contain its 0-th level beliefs, and its frames, and analogously for agent j . 0-level typesare, therefore, POMDPs. SM . An agent’s ﬁrst level beliefs are probability distributionsover physical states and 0-level models of the other agent. An agent’s ﬁrst level types consist ofits ﬁrst level beliefs and frames. Its ﬁrst level models consist of the types upto level 1 and thesubintentional models. Second level beliefs are deﬁned in terms of ﬁrst level models and so on.Formally, deﬁne spaces: IS i, = S, Θ j, = {h b j, , b θ j i : b j, ∈ ∆( IS j, ) } , M j, = Θ j, ∪ SM j IS i, = S × M j, , Θ j, = {h b j, , b θ j i : b j, ∈ ∆( IS j, ) } , M j, = Θ j, ∪ M j, . .. .. .IS i,l = S × M j,l − , Θ j,l = {h b j,l , b θ j i : b j,l ∈ ∆( IS j,l ) } , M j,l = Θ j,l ∪ M j,l − Deﬁnition 4. (Finitely Nested I-POMDP)

A ﬁnitely nested I-POMDP of agent i , I-POMDP i,l , is:I-POMDP i,l = h IS i,l , A, T i , Ω i , O i , R i i (9)

13. In 0-level types the other agent’s actions are folded into the T , O and R functions as noise.

60 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS

The parameter l will be called the strategy level of the ﬁnitely nested I-POMDP. The belief update,value function, and the optimal actions for ﬁnitely nested I-POMDPs are computed using Equation 6and Equation 8, but recursion is guaranteed to terminate at 0-th level and subintentional models.Agents which are more strategic are capable of modeling others at deeper levels (i.e., all levelsup to their own strategy level l ), but are always only boundedly optimal. As such, these agentscould fail to predict the strategy of a more sophisticated opponent. The fact that the computabilityof an agent function implies that the agent may be suboptimal during interactions has been pointedout by Binmore (1990), and proved more recently by Nachbar and Zame (1996). Intuitively, thedifﬁculty is that an agent’s unbounded optimality would have to include the capability to model theother agent’s modeling the original agent. This leads to an impossibility result due to self-reference,which is very similar to G¨odel’s incompleteness theorem and the halting problem (Brandenburger,2002). On a positive note, some convergence results (Kalai & Lehrer, 1993) strongly suggest thatapproximate optimality is achievable, although their applicability to our work remains open.As we mentioned, the 0-th level types are POMDPs. They provide probability distributionsover actions of the agent modeled at that level to models with strategy level of 1. Given probabilitydistributions over other agent’s actions the level-1 models can themselves be solved as POMDPs,and provide probability distributions to yet higher level models. Assume that the number of modelsconsidered at each level is bound by a number, M . Solving an I-POMDP i,l in then equivalent tosolving O ( M l ) POMDPs. Hence, the complexity of solving an

I-POMDP i,l is PSPACE-hard forﬁnite time horizons, and undecidable for inﬁnite horizons, just like for POMDPs. In this section we establish two important properties, namely convergence of value iteration andpiece-wise linearity and convexity of the value function, for ﬁnitely nested I-POMDPs.6.1.1 C

ONVERGENCE OF V ALUE I TERATION

For an agent i and its I-POMDP i,l , we can show that the sequence of value functions, { U n } , where n is the horizon, obtained by value iteration deﬁned in Eq. 7, converges to a unique ﬁxed-point, U ∗ .Let us deﬁne a backup operator H : B → B such that U n = HU n − , and B is the set of allbounded value functions. In order to prove the convergence result, we ﬁrst establish some of theproperties of H . Lemma 1 (Isotonicity).

For any ﬁnitely nested I-POMDP value functions V and U , if V ≤ U , then HV ≤ HU . The proof of this lemma is analogous to one due to Hauskrecht (1997), for POMDPs. It isalso sketched in the Appendix. Another important property exhibited by the backup operator is theproperty of contraction.

Lemma 2 (Contraction).

For any ﬁnitely nested I-POMDP value functions V , U and a discountfactor γ ∈ (0 , , || HV − HU || ≤ γ || V − U || . The proof of this lemma is again similar to the corresponding one in POMDPs (Hausktecht,1997). The proof makes use of Lemma 1. || · || is the supremum norm.

14. Usually PSPACE-complete since the number of states in I-POMDPs is likely to be larger than the time horizon(Papadimitriou & Tsitsiklis, 1987). MYTRASIEWICZ & D

OSHI

Under the contraction property of H , and noting that the space of value functions along withthe supremum norm forms a complete normed space (Banach space), we can apply the ContractionMapping Theorem (Stokey & Lucas, 1989) to show that value iteration for I-POMDPs convergesto a unique ﬁxed point (optimal solution). The following theorem captures this result. Theorem 1 (Convergence).

For any ﬁnitely nested I-POMDP, the value iteration algorithm start-ing from any arbitrary well-deﬁned value function converges to a unique ﬁxed-point.

The detailed proof of this theorem is included in the Appendix.As in the case of POMDPs (Russell & Norvig, 2003), the error in the iterative estimates, U n , forﬁnitely nested I-POMDPs, i.e., || U n − U ∗ || , is reduced by the factor of at least γ on each iteration.Hence, the number of iterations, N , needed to reach an error of at most ǫ is: N = ⌈ log( R max /ǫ (1 − γ )) / log(1 /γ ) ⌉ (10)where R max is the upper bound of the reward function.6.1.2 P IECEWISE L INEARITY AND C ONVEXITY

Another property that carries over from POMDPs to ﬁnitely nested I-POMDPs is the piecewiselinearity and convexity (PWLC) of the value function. Establishing this property allows us to de-compose the I-POMDP value function into a set of alpha vectors, each of which represents a policytree. The PWLC property enables us to work with sets of alpha vectors rather than perform valueiteration over the continuum of agent’s beliefs. Theorem 2 below states the PWLC property of theI-POMDP value function.

Theorem 2 (PWLC).

For any ﬁnitely nested I-POMDP, U is piecewise linear and convex. The complete proof of Theorem 2 is included in the Appendix. The proof is similar to onedue to Smallwood and Sondik (1973) for POMDPs and proceeds by induction. The basis case isestablished by considering the horizon 1 value function. Showing the PWLC for the inductive steprequires substituting the belief update (Eq. 6) into Eq. 7, followed by factoring out the belief fromboth terms of the equation.

7. Example: Multi-agent Tiger Game

To illustrate optimal sequential behavior of agents in multi-agent settings we apply our I-POMDPframework to the multi-agent tiger game, a traditional version of which we described before.

Let us denote the actions of opening doors and listening as OR, OL and L, as before. TL andTR denote states corresponding to tiger located behind the left and right door, respectively. Thetransition, reward and observation functions depend now on the actions of both agents. Again, weassume that the tiger location is chosen randomly in the next time step if any of the agents openedany doors in the current step. We also assume that the agent hears the tiger’s growls, GR and GL,with the accuracy of 85%. To make the interaction more interesting we added an observation ofdoor creaks, which depend on the action executed by the other agent. Creak right, CR, is likely due

62 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS to the other agent having opened the right door, and similarly for creak left, CL. Silence, S, is a goodindication that the other agent did not open doors and listened instead. We assume that the accuracyof creaks is 90%. We also assume that the agent’s payoffs are analogous to the single agent versionsdescribed in Section 3.2 to make these cases comparable. Note that the result of this assumption isthat the other agent’s actions do not impact the original agent’s payoffs directly, but rather indirectlyby resulting in states that matter to the original agent. Table 1 quantiﬁes these factors. h a i , a j i State TL TR h OL, ∗i * 0.5 0.5 h OR, ∗i * 0.5 0.5 h∗ , OL i * 0.5 0.5 h∗ , OR i * 0.5 0.5 h L, L i T L h L, L i T R h a i , a j i TL TR h OR, OR i

10 -100 h OL, OL i -100 10 h OR, OL i

10 -100 h OL, OR i -100 10 h L, L i -1 -1 h L, OR i -1 -1 h OR, L i

10 -100 h L, OL i -1 -1 h OL, L i -100 10 h a i , a j i TL TR h OR, OR i

10 -100 h OL, OL i -100 10 h OR, OL i -100 10 h OL, OR i

10 -100 h L, L i -1 -1 h L, OR i

10 -100 h OR, L i -1 -1 h L, OL i -100 10 h OL, L i -1 -1 Transition function: T i = T j Reward functions of agents i and j h a i , a j i State h GL, CL i h

GL, CR i h

GL, S i h

GR, CL i h

GR, CR i h

GR, S ih L, L i T L h L, L i T R h L, OL i T L h L, OL i T R h L, OR i T L h L, OR i T R h OL, ∗i ∗ h OR, ∗i ∗ h a i , a j i State h GL, CL i h

GL, CR i h

GL, S i h

GR, CL i h

GR, CR i h

GR, S ih L, L i T L h L, L i T R h OL, L i T L h OL, L i T R h OR, L i T L h OR, L i T R h∗ , OL i ∗ h∗ , OR i ∗ Observation functions of agents i and j .Table 1: Transition, reward, and observation functions for the multi-agent Tiger game.When an agent makes its choice in the multi-agent tiger game, it considers what it believesabout the location of the tiger, as well as whether the other agent will listen or open a door, which inturn depends on the other agent’s beliefs, reward function, optimality criterion, etc. In particular,if the other agent were to open any of the doors the tiger location in the next time step would bechosen randomly. Thus, the information obtained from hearing the previous growls would have tobe discarded. We simplify the situation by considering i ’s I-POMDP with a single level of nesting,assuming that all of the agent j ’s properties, except for beliefs, are known to i , and that j ’s timehorizon is equal to i ’s. In other words, i ’s uncertainty pertains only to j ’s beliefs and not to itsframe. Agent i ’s interactive state space is, IS i, = S × Θ j, , where S is the physical state, S = { TL,

15. We assume an intentional model of the other agent here. MYTRASIEWICZ & D

OSHI TR } , and Θ j, is a set of intentional models of agent j ’s, each of which differs only in j ’s beliefsover the location of the tiger. In Section 5, we presented the belief update equation for I-POMDPs (Eq. 6). Here we considerexamples of beliefs, b i, , of agent i , which are probability distributions over S × Θ j, . Each 0-thlevel type of agent j , θ j, ∈ Θ j, , contains a “ﬂat” belief as to the location of the tiger, which can berepresented by a single probability assignment – b j, = p j ( T L ) . P r( T L , b_ j ) b_j p (TL) j P r( TL , p ) j p (TL) j P r( T R , b_ j ) b_j p (TL) j p (TR) j P r( T R , p ) j p (TL) j ( i ) P r( T L , b_ j ) b_j p (TL) j P r( TL , p ) j P r( T R , b_ j ) b_j p (TL) j P r( T R , p ) j ( ii ) Figure 5: Two examples of singly nested belief states of agent i . In each case i has no informationabout the tiger’s location. In ( i ) agent i knows that j does not know the location of thetiger; the single point (star) denotes a Dirac delta function which integrates to the heightof the point, here 0.5 . In ( ii ) agent i is uninformed about j ’s beliefs about tiger’s location.In Fig. 5 we show some examples of level 1 beliefs of agent i . In each case i does not knowthe location of the tiger so that the marginals in the top and bottom sections of the ﬁgure sum up to0.5 for probabilities of TL and TR each. In Fig. 5 ( i ) , i knows that j assigns 0.5 probability to tigerbeing behind the left door. This is represented using a Dirac delta function. In Fig. 5 ( ii ) , agent i isuninformed about j ’s beliefs. This is represented as a uniform probability density over all values ofthe probability j could assign to state TL.To make the presentation of the belief update more transparent we decompose the formula inEq. 6 into two steps:

64 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS • Prediction:

When agent i performs an action a t − i , and given that agent j performs a t − j , thepredicted belief state is: b b ti ( is t ) = P r ( is t | a t − i , a t − j , b t − i ) = P is t − | b θ t − j = b θ tj b t − i ( is t − ) P r ( a t − j | θ t − j ) × T ( s t − , a t − , s t ) P o tj O j ( s t , a t − , o tj ) × τ θ tj ( b t − j , a t − j , o tj , b tj ) (11) • Correction:

When agent i perceives an observation, o ti , the predicted belief states, P r ( ·| a t − i , a t − j , b t − i ) , are combined according to: b ti ( is t ) = P r ( is t | o ti , a t − i , b t − i ) = β X a t − j O i ( s t , a t − , o ti ) P r ( is t | a t − i , a t − j , b t − i ) (12)where β is the normalizing constant. P r( T L , b_ j ) b_j0.4940.4960.4980.50.5020.5040.506 0 0.2 0.4 0.6 0.8 1 P r( T R , b_ j ) b_j 0.050.10.150.20.250.30.350.40.45 0 0.2 0.4 0.6 0.8 1 P r( T L , b_ j ) b_j 00.0050.010.0150.020.025 0 0.2 0.4 0.6 0.8 1 P r( T R , b_ j ) b_j0.020.040.060.080.10.120.14 0 0.2 0.4 0.6 0.8 1 P r( T R , b_ j ) b_j 00.10.20.30.40.50.60.7 0 0.2 0.4 0.6 0.8 1 P r( T L , b_ j ) b_j0.10.20.30.40.50.60.70.8 0 0.2 0.4 0.6 0.8 1 P r( T L , b_ j ) b_j0.050.10.150.20.250.30.350.40.45 0 0.2 0.4 0.6 0.8 1 P r( T R , b_ j ) b_j L,(L,GL)L,(L,GR) b b i it t t+1 b t−1i b i L,(L,GR)L,(L,GL) L,L,L,L,L,L,L,L, p (TL) p (TL)p (TL) p (TL) p (TL)p (TL) j j p (TL) j jj j p (TL) j j P r( T R , p ) j P r( T R , p ) j P r( T R , p ) j P r( TL , p ) j P r( TL , p ) j P r( TL , p ) j P r( T R , p ) j P r( TL , p ) j (a) (b) (c) (d) Figure 6: A trace of the belief update of agent i . ( a ) depicts the prior. ( b ) is the result of predictiongiven i ’s listening action, L, and a pair denoting j ’s action and observation. i knows that j will listen and could hear tiger’s growl on the right or the left, and that the probabilities j would assign to TL are 0.15 or 0.85, respectively. ( c ) is the result of correction after i observes tiger’s growl on the left and no creaks, h GL,S i . The probability i assigns toTL is now greater than TR. ( d ) depicts the results of another update (both prediction andcorrection) after another listen action of i and the same observation, h GL,S i .Each discrete point above denotes, again, a Dirac delta function which integrates to the height ofthe point.In Fig. 6, we display the example trace through the update of singly nested belief. In the ﬁrstcolumn of Fig. 6, labeled (a), is an example of agent i ’s prior belief we introduced before, according MYTRASIEWICZ & D

OSHI to which i knows that j is uninformed of the location of the tiger. Let us assume that i listens andhears a growl from the left and no creaks. The second column of Fig. 6, (b), displays the predicted belief after i performs the listen action (Eq. 11). As part of the prediction step, agent i must solve j ’s model to obtain j ’s optimal action when its belief is 0.5 (term P r ( a t − j | θ j ) in Eq. 11). Given thevalue function in Fig. 3, this evaluates to probability of 1 for listen action, and zero for opening ofany of the doors. i also updates j ’s belief given that j listens and hears the tiger growling from eitherthe left, GL, or right, GR, (term τ θ tj ( b t − j , a t − j , o tj , b tj ) in Eq. 11). Agent j ’s updated probabilitiesfor tiger being on the left are 0.85 and 0.15, for j ’s hearing GL and GR, respectively. If the tiger ison the left (top of Fig. 6 (b)) j ’s observation GL is more likely, and consequently j ’s assigning theprobability of 0.85 to state TL is more likely ( i assigns a probability of 0.425 to this state.) Whenthe tiger is on the right j is more likely to hear GR and i assigns the lower probability, 0.075, to j ’s assigning a probability 0.85 to tiger being on the left. The third column, (c), of Fig. 6 showsthe posterior belief after the correction step. The belief in column (b) is updated to account for i ’shearing a growl from the left and no creaks, h GL,S i . The resulting marginalised probability of thetiger being on the left is higher (0.85) than that of the tiger being on the right. If we assume that inthe next time step i again listens and hears the tiger growling from the left and no creaks, the beliefstate depicted in the fourth column of Fig. 6 results.In Fig. 7 we show the belief update starting from the prior in Fig. 5 ( ii ) , according to whichagent i initially has no information about what j believes about the tiger’s location.The traces of belief updates in Fig. 6 and Fig. 7 illustrate the changing state of information agent i has about the other agent’s beliefs. The beneﬁt of representing these updates explicitly is that, ateach stage, i ’s optimal behavior depends on its estimate of probabilities of j ’s actions. The moreinformative these estimates are the more value agent i can expect out of the interaction. Below, weshow the increase in the value function for I-POMDPs compared to POMDPs with the noise factor. This section compares value functions obtained from solving a POMDP with a static noise factor,accounting for the presence of another agent, to value functions of level-1 I-POMDP. The advan-tage of more reﬁned modeling and update in I-POMDPs is due to two factors. First is the ability tokeep track of the other agent’s state of beliefs to better predict its future actions. The second is theability to adjust the other agent’s time horizon as the number of steps to go during the interactiondecreases. Neither of these is possible within the classical POMDP formalism.We continue with the simple example of I-POMDP i, of agent i . In Fig. 8 we display i ’svalue function for the time horizon of 1, assuming that i ’s initial belief as to the value j assignsto TL, p j ( T L ) , is as depicted in Fig. 5 ( ii ) , i.e. i has no information about what j believes abouttiger’s location. This value function is identical to the value function obtained for an agent usinga traditional POMDP framework with noise, as well as single agent POMDP which we describedin Section 3.2. The value functions overlap since agents do not have to update their beliefs andthe advantage of more reﬁned modeling of agent j in i ’s I-POMDP does not become apparent. Putanother way, when agent i models j using an intentional model, it concludes that agent j will openeach door with probability 0.1 and listen with probability 0.8. This coincides with the noise factorwe described in Section 3.2.

16. The points in Fig. 7 again denote Dirac delta functions which integrate to the value equal to the points’ height.17. The POMDP with noise is the same as level-0 I-POMDP.

66 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS b i t P r( T R , b_ j ) b_j 0.02225 0.0223 0.02235 0.0224 0.02245 0.0225 0.02255 0.0226 0.02265 0.0227 0.02275 0 0.2 0.4 0.6 0.8 1 P r( T L , b_ j ) b_j P r( T L , b_ j ) b_j0.4940.4960.4980.50.5020.5040.506 0 0.2 0.4 0.6 0.8 1 P r( T R , b_ j ) b_j P r( T L , b_ j ) b_j 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 P r( T R , b_ j ) b_j 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.2 0.4 0.6 0.8 1 P r( T R , b_ j ) b_j 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 0.2 0.4 0.6 0.8 1 P r( T L , b_ j ) b_j L(L,GR) b it−1 b it (a) L(L,GR)L(L,GL)L(OL/OR,*) (c)(b)L(L,GL)L(L,GR)L(L,GL) L(L,GR)L(OL/OR,*)L(OL/OR,*)L(L,GL) j j jj j jj j p (TL)p (TL) p (TL) p (TL)p (TL)p (TL) p (TL) p (TL) P r( TL , p ) P r( T R , p ) P r( T R , p ) P r( T R , p ) jj P r( TL , p ) j P r( TL , p ) j jj P r( T R , p ) j P r( TL , p ) j L(OL/OR,*)

Figure 7: A trace of the belief update of agent i . ( a ) depicts the prior according to which i isuninformed about j ’s beliefs. ( b ) is the result of the prediction step after i ’s listeningaction (L). The top half of ( b ) shows i ’s belief after it has listened and given that j alsolistened. The two observations j can make, GL and GR, each with probability dependenton the tiger’s location, give rise to ﬂat portions representing what i knows about j ’s beliefin each case. The increased probability i assigns to j ’s belief between 0.472 and 0.528 isdue to j ’s updates after it hears GL and after it hears GR resulting in the same values inthis interval. The bottom half of ( b ) shows i ’s belief after i has listened and j has openedthe left or right door (plots are identical for each action and only one of them is shown). i knows that j has no information about the tiger’s location in this case. ( c ) is the result ofcorrection after i observes tiger’s growl on the left and no creaks h GL,S i . The plots in ( c ) are obtained by performing a weighted summation of the plots in ( b ) . The probability i assigns to TL is now greater than TR, and information about j ’s beliefs allows i to reﬁneits prediction of j ’s action in the next time step. MYTRASIEWICZ & D

OSHI V a l ue F un c t i on ( U ) p_i(TL)Level 1 I-POMDP POMDP with noise LOL OR p (TL) i p (TL) i Figure 8: For time horizon of 1 the value functions obtained from solving a singly nested I-POMDPand a POMDP with noise factor overlap. -202468 0 0.2 0.4 0.6 0.8 1 V a l ue F un c t i on ( U ) p_i(TL)Level 1 I-POMDP POMDP with noise OL\();L\(*) L\();L\(GL),OL\(GR) L\();L\(*) L\();OR\(GL),L\(GR) OR\();L\(*)L\();OR\(),L\() L\();L\(),OL\()L\();OL\(),L\(?) L\();OR\(),L\(?) p (TL) i Figure 9: Comparison of value functions obtained from solving an I-POMDP and a POMDP withnoise for time horizon of 2. I-POMDP value function dominates due to agent i adjustingthe behavior of agent j to the remaining steps to go in the interaction.

68 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS V a l ue F un c t i on ( U ) p_i(TL)Level 1 I-POMDP POMDP with noise p (TL) i Figure 10: Comparison of value functions obtained from solving an I-POMDP and a POMDP withnoise for time horizon of 3. I-POMDP value function dominates due to agent i ’s adjust-ing j ’s remaining steps to go, and due to i ’s modeling j ’s belief update. Both factorsallow for better predictions of j ’s actions during interaction. The descriptions of indi-vidual policies were omitted for clarity; they can be read off of Fig. 11.In Fig. 9 we display i ’s value functions for the time horizon of 2. The value function of I-POMDP i, is higher than the value function of a POMDP with a noise factor. The reason isnot related to the advantages of modeling agent j ’s beliefs – this effect becomes apparent at the timehorizon of 3 and longer. Rather, the I-POMDP solution dominates due to agent i modeling j ’s timehorizon during interaction: i knows that at the last time step j will behave according to its optimalpolicy for time horizon of 1, while with two steps to go j will optimize according to its 2 steps to gopolicy. As we mentioned, this effect cannot be modeled using a POMDP with a static noise factorincluded in the transition function.Fig. 10 shows a comparison between the I-POMDP and the noisy POMDP value functions forhorizon 3. The advantage of more reﬁned agent modeling within the I-POMDP framework hasincreased. Both factors, i ’s adjusting j ’s steps to go and i ’s modeling j ’s belief update duringinteraction are responsible for the superiority of values achieved using the I-POMDP. In particular,recall that at the second time step i ’s information as to j ’s beliefs about the tiger’s location is asdepicted in Fig. 7 (c). This enables i to make a high quality prediction that, with two steps left togo, j will perform its actions OL, L, and OR with probabilities 0.009076, 0.96591 and 0.02501,respectively (recall that for POMDP with noise these probabilities remained unchanged at 0.1, 0,8,and 0.1, respectively.)Fig. 11 shows agent i ’s policy graph for time horizon of 3. As usual, it prescribes the optimalﬁrst action depending on the initial belief as to the tiger’s location. The subsequent actions dependon the observations received. The observations include creaks that are indicative of the other agent’s

18. Note that I-POMDP solution is not as good as the solution of a POMDP for an agent operating alone in the environ-ment shown in Fig. 3. MYTRASIEWICZ & D

OSHI

LL ORL L LLL L OROLOL ORLOL* *** * [0 −− 0.029) [0.029 −− 0.089) [0.089 −− 0.211) [0.211 −− 0.789) [0.789 −− 0.911) [0.911 −− 0.971) [0.971 −− 1]

Figure 11: The policy graph corresponding to the I-POMDP value function in Fig. 10.having opened a door. The creaks contain valuable information and allow the agent to make morereﬁned choices, compared to ones in the noisy POMDP in Fig. 4. Consider the case when agent i starts out with fairly strong belief as to the tiger’s location, decides to listen (according to the fouroff-center top row “L” nodes in Fig. 11) and hears a door creak. The agent is then in the position toopen either the left or the right door, even if that is counter to its initial belief. The reason is that thecreak is an indication that the tiger’s position has likely been reset by agent j and that j will thennot open any of the doors during the following two time steps. Now, two growls coming from thesame door lead to enough conﬁdence to open the other door. This is because the agent i ’s hearingof tiger’s growls are indicative of the tiger’s position in the state following the agents’ actions,Note that the value functions and the policy above depict a special case of agent i having noinformation as to what probability j assigns to tiger’s location (Fig. 5 ( ii )). Accounting for andvisualizing all possible beliefs i can have about j ’s beliefs is difﬁcult due to the complexity of thespace of interactive beliefs. As our ongoing work indicates, a drastic reduction in complexity ispossible without loss of information, and consequently representation of solutions in a manageablenumber of dimensions is indeed possible. We will report these results separately.

8. Conclusions

We proposed a framework for optimal sequential decision-making suitable for controlling autonomousagents interacting with other agents within an uncertain environment. We used the normativeparadigm of decision-theoretic planning under uncertainty formalized as partially observable Markovdecision processes (POMDPs) as a point of departure. We extended POMDPs to cases of agentsinteracting with other agents by allowing them to have beliefs not only about the physical environ-ment, but also about the other agents. This could include beliefs about the others’ abilities, sensingcapabilities, beliefs, preferences, and intended actions. Our framework shares numerous propertieswith POMDPs, has analogously deﬁned solutions, and reduces to POMDPs when agents are alonein the environment.In contrast to some recent work on DEC-POMDPs (Bernstein et al., 2002; Nair et al., 2003),and to work motivated by game-theoretic equilibria (Boutilier, 1999; Hu & Wellman, 1998; Koller

70 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS & Milch, 2001; Littman, 1994), our approach is subjective and amenable to agents independentlycomputing their optimal solutions.The line of work presented here opens an area of future research on integrating frameworks forsequential planning with elements of game theory and Bayesian learning in interactive settings. Inparticular, one of the avenues of our future research centers on proving further formal properties ofI-POMDPs, and establishing clearer relations between solutions to I-POMDPs and various ﬂavorsof equilibria. Another concentrates on developing efﬁcient approximation techniques for solvingI-POMDPs. As for POMDPs, development of approximate approaches to I-POMDPs is crucial formoving beyond toy problems. One promising approximation technique we are working on is particleﬁltering. We are also devising methods for representing I-POMDP solutions without assumptionsabout what’s believed about other agents’ beliefs. As we mentioned, in spite of the complexity of theinteractive state space, there seem to be intuitive representations of belief partitions correspondingto optimal policies, analogous to those for POMDPs. Other research issues include the suitablechoice of priors over models, and the ways to fulﬁll the absolute continuity condition needed forconvergence of probabilities assigned to the alternative models during interactions (Kalai & Lehrer,1993). Acknowledgments

This research is supported by the National Science Foundation CAREER award IRI-9702132, andNSF award IRI-0119270.

Appendix A. Proofs

Proof of Propositions 1 and 2.

We start with Proposition 2, by applying the Bayes Theorem: b ti ( is t ) = P r ( is t | o ti , a t − i , b t − i ) = P r ( is t ,o ti | a t − i ,b t − i ) P r ( o ti | a t − i ,b t − i ) = β P is t − b t − i ( is t − ) P r ( is t , o ti | a t − i , is t − )= β P is t − b t − i ( is t − ) P a t − j P r ( is t , o ti | a t − i , a t − j , is t − ) P r ( a t − j | a t − i , is t − )= β P is t − b t − i ( is t − ) P a t − j P r ( is t , o ti | a t − i , a t − j , is t − ) P r ( a t − j | is t − )= β P is t − b t − i ( is t − ) P a t − j P r ( a t − j | m t − j ) P r ( o it | is t , a t − , is t − ) P r ( is t | a t − , is t − )= β P is t − b t − i ( is t − ) P a t − j P r ( a t − j | m t − j ) P r ( o it | is t , a t − ) P r ( is t | a t − , is t − )= β P is t − b t − i ( is t − ) P a t − j P r ( a t − j | m t − j ) O i ( s t , a t − , o ti ) P r ( is t | a t − , is t − ) (13)

19. We are looking at Kolmogorov complexity (Li & Vitanyi, 1997) as a possible way to assign priors. MYTRASIEWICZ & D

OSHI

To simplify the term

P r ( is t | a t − , is t − ) let us substitute the interactive state is t with its com-ponents. When m j in the interactive states is intentional: is t = ( s t , θ tj ) = ( s t , b tj , b θ tj ) . P r ( is t | a t − , is t − ) = P r ( s t , b tj , b θ tj | a t − , is t − )= P r ( b tj | s t , b θ tj , a t − , is t − ) P r ( s t , b θ tj | a t − , is t − )= P r ( b tj | s t , b θ tj , a t − , is t − ) P r ( b θ tj | s t , a t − , is t − ) P r ( s t | a t − , is t − )= P r ( b tj | s t , b θ tj , a t − , is t − ) I ( b θ t − j , b θ tj ) T i ( s t − , a t − , s t ) (14)When m j is subintentional: is t = ( s t , m tj ) = ( s t , h tj , b m tj ) . P r ( is t | a t − , is t − ) = P r ( s t , h tj , b m tj | a t − , is t − )= P r ( h tj | s t , b m tj , a t − , is t − ) P r ( s t , b m tj | a t − , is t − )= P r ( h tj | s t , b m tj , a t − , is t − ) P r ( b θ tj | s t , a t − , is t − ) P r ( s t | a t − , is t − )= P r ( h tj | s t , b m tj , a t − , is t − ) I ( b m t − j , b m tj ) T i ( s t − , a t − , s t ) (14’)The joint action pair, a t − , may change the physical state. The third term on the right-handside of Eqs. 14 and ′ above captures this transition. We utilized the MNM assumption to replacethe second terms of the equations with boolean identity functions, I ( b θ t − j , b θ tj ) and I ( b m t − j , b m tj ) respectively, which equal 1 if the two frames are identical, and 0 otherwise. Let us turn our attentionto the ﬁrst terms. If m j in is t and is t − is intentional: P r ( b tj | s t , b θ tj , a t − , is t − ) = P o tj P r ( b tj | s t , b θ tj , a t − , is t − , o tj ) P r ( o tj | s t , b θ tj , a t − , is t − )= P o tj P r ( b tj | s t , b θ tj , a t − , is t − , o tj ) P r ( o tj | s t , b θ tj , a t − )= P o tj τ θ tj ( b t − j , a t − j , o tj , b tj ) O j ( s t , a t − , o tj ) (15)Else if it is subintentional: P r ( h tj | s t , b m tj , a t − , is t − ) = P o tj P r ( h tj | s t , b m tj , a t − , is t − , o tj ) P r ( o tj | s t , b m tj , a t − , is t − )= P o tj P r ( h tj | s t , b m tj , a t − , is t − , o tj ) P r ( o tj | s t , b m tj , a t − )= P o tj δ K ( APPEND ( h t − j , o tj ) , h tj ) O j ( s t , a t − , o tj ) (15’)In Eq. 15, the ﬁrst term on the right-hand side is 1 if agent j ’s belief update, SE θ j ( b t − j , a t − j , o tj ) generates a belief state equal to b tj . Similarly, in Eq. ′ , the ﬁrst term is 1 if appending the o tj to h t − j results in h tj . δ K is the Kronecker delta function. In the second terms on the right-handside of the equations, the MNO assumption makes it possible to replace P r ( o tj | s t , b θ tj , a t − ) with O j ( s t , a t − , o tj ) , and P r ( o tj | s t , b m tj , a t − ) with O j ( s t , a t − , o tj ) respectively.Let us now substitute Eq. 15 into Eq. 14. P r ( is t | a t − , is t − ) = P o tj τ θ tj ( b t − j , a t − j , o tj , b tj ) O j ( s t , a t − , o tj ) I ( b θ t − j , b θ tj ) T i ( s t − , a t − , s t ) (16)Substituting Eq. ′ into Eq. ′ we get, P r ( is t | a t − , is t − ) = P o tj δ K ( APPEND ( h t − j , o tj ) , h tj ) O j ( s t , a t − , o tj ) I ( b m t − j , b m tj ) × T i ( s t − , a t − , s t ) (16’)

72 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS

Replacing Eq. 16 into Eq. 13 we get: b ti ( is t ) = β P is t − b t − i ( is t − ) P a t − j P r ( a t − j | θ t − j ) O i ( s t , a t − , o ti ) P o tj τ θ tj ( b t − j , a t − j , o tj , b tj ) × O j ( s t , a t − , o tj ) I ( b θ t − j , b θ tj ) T i ( s t − , a t − , s t ) (17)Similarly, replacing Eq. ′ into Eq. 13 we get: b ti ( is t ) = β P is t − b t − i ( is t − ) P a t − j P r ( a t − j | m t − j ) O i ( s t , a t − , o ti ) × P o tj δ K ( APPEND ( h t − j , o tj ) , h tj ) O j ( s t , a t − , o tj ) I ( b m t − j , b m tj ) T i ( s t − , a t − , s t )(17 ′ ) We arrive at the ﬁnal expressions for the belief update by removing the terms I ( b θ t − j , b θ tj ) and I ( b m t − j , b m tj ) and changing the scope of the ﬁrst summations.When m j in the interactive states is intentional: b ti ( is t ) = β P is t − : b m t − j = b θ tj b t − i ( is t − ) P a t − j P r ( a t − j | θ t − j ) O i ( s t , a t − , o ti ) × P o tj τ θ tj ( b t − j , a t − j , o tj , b tj ) O j ( s t , a t − , o tj ) T i ( s t − , a t − , s t ) (18)Else, if it is subintentional: b ti ( is t ) = β P is t − : b m t − j = b m tj b t − i ( is t − ) P a t − j P r ( a t − j | m t − j ) O i ( s t , a t − , o ti ) × P o tj δ K ( APPEND ( h t − j , o tj ) , h tj ) O j ( s t , a t − , o tj ) T i ( s t − , a t − , s t ) (19)Since proposition 2 expresses the belief b ti ( is t ) in terms of parameters of the previous time steponly, Proposition 1 holds as well.Before we present the proof of Theorem 1 we note that the Equation 7, which deﬁnes valueiteration in I-POMDPs, can be rewritten in the following form, U n = HU n − . Here, H : B → B is a backup operator, and is deﬁned as, HU n − ( θ i ) = max a i ∈ A i h ( θ i , a i , U n − ) where h : Θ i × A i × B → R is, h ( θ i , a i , U ) = P is b i ( is ) ER i ( is, a i ) + γ P o ∈ Ω i P r ( o i | a i , b i ) U ( h SE θ i ( b i , a i , o i ) , ˆ θ i i ) and where B is the set of all bounded value functions U . Lemmas 1 and 2 establish importantproperties of the backup operator. Proof of Lemma 1 is given below, and proof of Lemma 2 followsthereafter. Proof of Lemma 1.

Select arbitrary value functions V and U such that V ( θ i,l ) ≤ U ( θ i,l ) ∀ θ i,l ∈ Θ i,l . Let θ i,l be an arbitrary type of agent i . MYTRASIEWICZ & D

OSHI HV ( θ i,l ) = max a i ∈ A i (cid:26) P is b i ( is ) ER i ( is, a i ) + γ P o ∈ Ω i P r ( o i | a i , b i ) V ( h SE θ i,l ( b i , a i , o i ) , ˆ θ i i ) (cid:27) = P is b i ( is ) ER i ( is, a ∗ i ) + γ P o ∈ Ω i P r ( o i | a ∗ i , b i ) V ( h SE θ i,l ( b i , a ∗ i , o i ) , ˆ θ i i ) ≤ P is b i ( is ) ER i ( is, a ∗ i ) + γ P o ∈ Ω i P r ( o i | a ∗ i , b i ) U ( h SE θ i,l ( b i , a ∗ i , o i ) , ˆ θ i i ) ≤ max a i ∈ A i (cid:26) P is b i ( is ) ER i ( is, a i ) + γ P o ∈ Ω i P r ( o i | a i , b i ) U ( h SE θ i,l ( b i , a i , o i ) , ˆ θ i i ) (cid:27) = HU ( θ i,l ) Since θ i,l is arbitrary, HV ≤ HU . Proof of Lemma 2.

Assume two arbitrary well deﬁned value functions V and U such that V ≤ U .From Lemma 1 it follows that HV ≤ HU . Let θ i,l be an arbitrary type of agent i . Also, let a ∗ i bethe action that optimizes HU ( θ i,l ) . ≤ HU ( θ i,l ) − HV ( θ i,l )= max a i ∈ A i (cid:26) sum is b i ( is ) ER i ( is, a i ) + γ P o ∈ Ω i P r ( o i | a i , b i ) U ( SE θ i,l ( b i , a i , o i ) , h ˆ θ i i ) (cid:27) − max a i ∈ A i (cid:26) P is b i ( is ) ER i ( is, a i ) + γ P o ∈ Ω i P r ( o i | a i , b i ) V ( SE θ i,l ( b i , a i , o i ) , h ˆ θ i i ) (cid:27) ≤ P is b i ( is ) ER i ( is, a ∗ i ) + γ P o ∈ Ω i P r ( o i | a ∗ i , b i ) U ( SE θ i,l ( b i , a ∗ i , o i ) , h ˆ θ i i ) − P is b i ( is ) ER i ( is, a ∗ i ) − γ P o ∈ Ω i P r ( o i | a ∗ i , b i ) V ( SE θ i,l ( b i , a ∗ i , o i ) , h ˆ θ i i )= γ P o ∈ Ω i P r ( o i | a ∗ i , b i ) U ( SE θ i,l ( b i , a ∗ i , o i ) , h ˆ θ i i ) − γ P o ∈ Ω i P r ( o i | a ∗ i , b i ) V ( SE θ i,l ( b i , a ∗ i , o i ) , h ˆ θ i i )= γ P o ∈ Ω i P r ( o i | a ∗ i , b i ) (cid:20) U ( SE θ i,l ( b i , a ∗ i , o i ) , h ˆ θ i i ) − V ( SE θ i,l ( b i , a ∗ i , o i ) , h ˆ θ i i ) (cid:27) ≤ γ P o ∈ Ω i P r ( o i | a ∗ i , b i ) || U − V || = γ || U − V || As the supremum norm is symmetrical, a similar result can be derived for HV ( θ i,l ) − HU ( θ i,l ) .Since θ i,l is arbitrary, the Contraction property follows, i.e. || HV − HU || ≤ || V − U || .Lemmas 1 and 2 provide the stepping stones for proving Theorem 1. Proof of Theorem 1 followsfrom a straightforward application of the Contraction Mapping Theorem. We state the ContractionMapping Theorem (Stokey & Lucas, 1989) below: Theorem 3 (Contraction Mapping Theorem). If ( S, ρ ) is a complete metric space and T : S → S is a contraction mapping with modulus γ , then1. T has exactly one ﬁxed point U ∗ in S , and2. The sequence { U n } converges to U ∗ . Proof of Theorem 1 follows.

74 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS

Proof of Theorem 1.

The normed space ( B, || · || ) is complete w.r.t the metric induced by the supre-mum norm. Lemma 2 establishes the contraction property of the backup operator, H . Using The-orem 3, and substituting T with H , convergence of value iteration in I-POMDPs to a unique ﬁxedpoint is established.We go on to the piecewise linearity and convexity (PWLC) property of the value function.We follow the outlines of the analogous proof for POMDPs in (Hausktecht, 1997; Smallwood &Sondik, 1973).Let α : IS → R be a real-valued and bounded function. Let the space of such real-valuedbounded functions be B ( IS ) . We will now deﬁne an inner product. Deﬁnition 5 (Inner product).

Deﬁne the inner product, h· , ·i : B ( IS ) × ∆( IS ) → R , by h α, b i i = X is b i ( is ) α ( is ) The next lemma establishes the bilinearity of the inner product deﬁned above.

Lemma 3 (Bilinearity).

For any s, t ∈ R , f, g ∈ B ( IS ) , and b, λ ∈ ∆( IS ) the following equalitieshold: h sf + tg, b i = s h f, b i + t h g, b ih f, sb + tλ i = s h f, b i + t h f, λ i We are now ready to give the proof of Theorem 2. Theorem 4 restates Theorem 2 mathemati-cally, and its proof follows thereafter.

Theorem 4 (PWLC).

The value function, U n , in ﬁnitely nested I-POMDP is piece-wise linear andconvex (PWLC). Mathematically, U n ( θ i,l ) = max α n X is b i ( is ) α n ( is ) n = 1 , , ... Proof of Theorem 4.

Basis Step: n = 1 From Bellman’s Dynamic Programming equation, U ( θ i ) = max a i X is b i ( is ) ER ( is, a i ) (20)where ER i ( is, a i ) = P a j R ( is, a i , a j ) P r ( a j | m j ) . Here, ER i ( · ) represents the expectation of R w.r.t. agent j ’s actions. Eq. 20 represents an inner product and using Lemma 3, the inner productis linear in b i . By selecting the maximum of a set of linear vectors (hyperplanes), we obtain a PWLChorizon 1 value function. Inductive Hypothesis:

Suppose that U n − ( θ i,l ) is PWLC. Formally we have, U n − ( θ i,l ) = max α n − P is b i ( is ) α n − ( is )= max ˙ α n − , ¨ α n − (cid:26) P is : m j ∈ IM j b i ( is ) ˙ α n − ( is ) + P is : m j ∈ SM j b i ( is ) ¨ α n − ( is ) (cid:27) (21) MYTRASIEWICZ & D

OSHI

Inductive Proof:

To show that U n ( θ i,l ) is PWLC. U n ( θ i,l ) = max a t − i ( X is t − b t − i ( is t − ) ER i ( is t − , a t − i ) + γ X o ti P r ( o ti | a t − i , b t − i ) U n − ( θ i,l ) ) From the inductive hypothesis: U n ( θ i,l ) = max a t − i ( P is t − b t − i ( is t − ) ER i ( is t − , a t − i )+ γ P o ti P r ( o ti | a t − i , b t − i ) max α n − ∈ Γ n − P is t b ti ( is t ) α n − ( is t ) ) Let l ( b t − i , a t − i , o ti ) be the index of the alpha vector that maximizes the value at b ti = SE ( b t − i , a t − i , o ti ) .Then, U n ( θ i,l ) = max a t − i ( P is t − b t − i ( is t − ) ER i ( is t − , a t − i )+ γ P o ti P r ( o ti | a t − i , b t − i ) P is t b ti ( is t ) α n − l ( b t − i ,a t − i ,o ti ) ) From the second equation in the inductive hypothesis: U n ( θ i,l ) = max a t − i ( P is t − b t − i ( is t − ) ER i ( is t − , a t − i ) + γ P o ti P r ( o ti | a t − i , b t − i ) × (cid:26) P is t : m tj ∈ IM j b ti ( is t ) ˙ α n − l ( b t − i ,a t − i ,o ti ) + P is t : m tj ∈ SM j b ti ( is t ) ¨ α n − l ( b t − i ,a t − i ,o ti ) (cid:27)) Substituting b ti with the appropriate belief updates from Eqs. 17 and ′ we get: U n ( θ i,l ) = max a t − i ( P is t − b t − i ( is t − ) ER i ( is t − , a t − i ) + γ P o ti P r ( o ti | a t − i , b t − i ) × β " P is t : m tj ∈ IM j P is t − b t − i ( is t − ) (cid:26) P a t − j P r ( a t − j | θ t − j ) (cid:20) O i ( s t , a t − , o ti ) × P o tj O tj ( s t , a t − , o tj ) (cid:26) τ θ tj ( b t − j , a t − j , o tj , b tj ) I ( b θ t − j , b θ tj ) T i ( s t − , a t − , s t ) (cid:27)(cid:21)(cid:27) × ˙ α n − l ( b t − i ,a t − i ,o ti ) ( is t )+ P is t : m tj ∈ SM j P is t − b t − i ( is t − ) (cid:26) P a t − j P r ( a t − j | m t − j ) (cid:20) O i ( s t , a t − , o ti ) × P o tj O tj ( s t , a t − , o tj ) (cid:26) δ K ( APPEND ( h t − j , o tj ) − h tj ) I ( b m t − j , b m tj ) T i ( s t − , a t − , s t ) (cid:27)(cid:21)(cid:27) × ¨ α n − l ( b t − i ,a t − i ,o ti ) ( is t )

76 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS and further U n ( θ i,l ) = max a t − i ( P is t − b t − i ( is t − ) ER i ( is t − , a t − i ) + γ P o ti " P is t : m tj ∈ IM j × P is t − b t − i ( is t − ) (cid:26) P a t − j P r ( a t − j | θ t − j ) (cid:20) O i ( s t , a t − , o ti ) × P o tj O tj ( s t , a t − , o tj ) (cid:26) τ θ tj ( b t − j , a t − j , o tj , b tj ) I ( b θ t − j , b θ tj ) T i ( s t − , a t − , s t ) (cid:27)(cid:21)(cid:27) × ˙ α n − l ( b t − i ,a t − i ,o ti ) ( is t )+ P is t : m tj ∈ SM j P is t − b t − i ( is t − ) (cid:26) P a t − j P r ( a t − j | m t − j ) (cid:20) O i ( s t , a t − , o ti ) × P o tj O tj ( s t , a t − , o tj ) (cid:26) δ K ( APPEND ( h t − j , o tj ) − h tj ) I ( b m t − j , b m tj ) T i ( s t − , a t − , s t ) (cid:27)(cid:21)(cid:27) × ¨ α n − l ( b t − i ,a t − i ,o ti ) ( is t ) Rearranging the terms of the equation: U n ( θ i,l ) = max a t − i ( P is t − : m t − j ∈ IM j b t − i ( is t − ) (cid:26) ER i ( is t − , a t − i ) + γ P o ti P is t : m tj ∈ IM j × (cid:26) P a t − j P r ( a t − j | θ t − j ) (cid:20) O i ( s t , a t − , o ti ) P o tj O tj ( s t , a t − , o tj ) × (cid:26) τ θ tj ( b t − j , a t − j , o tj , b tj ) I ( b θ t − j , b θ tj ) T i ( s t − , a t − , s t ) (cid:27)(cid:21)(cid:27) ˙ α n − l ( b t − i ,a t − i ,o ti ) ( is t ) (cid:27) + P is t − : m t − j ∈ SM j b t − i ( is t − ) (cid:26) ER i ( is t − , a t − i ) + γ P o ti P is t : m tj ∈ SM j P o ti × (cid:26) P a t − j P r ( a t − j | m t − j ) (cid:20) O i ( s t , a t − , o ti ) P o tj O tj ( s t , a t − , o tj ) × (cid:26) δ K ( APPEND ( h t − j , o tj ) − h tj ) I ( b m t − j , b m tj ) T i ( s t − , a t − , s t ) (cid:27)(cid:21)(cid:27) ¨ α n − l ( b t − i ,a t − i ,o ti ) ( is t ) (cid:27)) = max a t − i (cid:26) P is t − : m t − j ∈ IM j b t − i ( is t − ) ˙ α na i ( is t − )+ P is t − : m t − j ∈ SM j b t − i ( is t − ) ¨ α na i ( is t − ) (cid:27) Therefore, U n ( θ i,l ) = max ˙ α n , ¨ α n (cid:26) P is t − : m t − j ∈ IM j b t − i ( is t − ) ˙ α n ( is t − )+ P is t − : m t − j ∈ SM j b t − i ( is t − ) ¨ α n ( is t − ) (cid:27) = max α n P is t − b t − i ( is t − ) α n ( is t − ) = max α n h b t − i , α n i (22) MYTRASIEWICZ & D

OSHI where, if m t − j in is t − is intentional then α n = ˙ α n : ˙ α n ( is t − ) = ER i ( is t − , a t − i ) + γ P o ti P is t : m tj ∈ IM j (cid:26) P a t − j P r ( a t − j | θ t − j ) (cid:20) O i ( is t , a t − , o ti ) × P o tj O tj ( is tj , a t − , o tj ) (cid:26) τ θ tj ( b t − j , a t − j , o tj , b tj ) I ( b θ t − j , b θ tj ) T i ( s t − , a t − , s t ) (cid:27)(cid:21)(cid:27) × α n − l ( b t − i ,a t − i ,o ti ) ( is t ) and, if m t − j is subintentional then α n = ¨ α n : ¨ α n ( is t − ) = ER i ( is t − , a t − i ) + γ P o ti P is t : m tj ∈ SM j (cid:26) P a t − j P r ( a t − j | θ t − j ) (cid:20) O i ( is t , a t − , o ti ) × P o tj O tj ( is tj , a t − , o tj ) (cid:26) δ K ( APPEND ( h t − j , o tj ) − h tj ) I ( b m t − j , b m tj ) T i ( s t − , a t − , s t ) (cid:27)(cid:21)(cid:27) × α n − l ( b t − i ,a t − i ,o ti ) ( is t ) Eq. 22 is an inner product and using Lemma 3, the value function is linear in b t − i . Furthermore,maximizing over a set of linear vectors (hyperplanes) produces a piecewise linear and convex valuefunction. References

Ambruster, W., & Boge, W. (1979). Bayesian game theory. In Moeschlin, O., & Pallaschke, D. (Eds.),

GameTheory and Related Topics . North Holland.Aumann, R. J. (1999). Interactive epistemology i: Knowledge.

International Journal of Game Theory , pp.263–300.Aumann, R. J., & Heifetz, A. (2002). Incomplete information. In Aumann, R., & Hart, S. (Eds.),

Handbookof Game Theory with Economic Applications, Volume III, Chapter 43 . Elsevier.Battigalli, P., & Siniscalchi, M. (1999). Hierarchies of conditional beliefs and interactive epistemology indynamic games.

Journal of Economic Theory , pp. 188–230.Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized controlof markov decision processes.

Mathematics of Operations Research , (4), 819–840.Binmore, K. (1990). Essays on Foundations of Game Theory . Blackwell.Boutilier, C. (1999). Sequential optimality and coordination in multiagent systems. In

Proceedings of theSixteenth International Joint Conference on Artiﬁcial Intelligence , pp. 478–485.Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoretic planning: Structural assumptions and compu-tational leverage.

Journal of Artiﬁcial intelligence Research , , 1–94.Brandenburger, A. (2002). The power of paradox: Some recent developments in interactive epistemology.Tech. rep., Stern School of Business, New York University, http://pages.stern.nyu.edu/ abranden/.Brandenburger, A., & Dekel, E. (1993). Hierarchies of beliefs and common knowledge. Journal of EconomicTheory , , 189–198.Dennett, D. (1986). Intentional systems. In Dennett, D. (Ed.), Brainstorms . MIT Press.Fagin, R. R., Geanakoplos, J., Halpern, J. Y., & Vardi, M. Y. (1999). A hierarchical approach to modelingknowledge and common knowledge.

International Journal of Game Theory , pp. 331–365.Fagin, R. R., Halpern, J. Y., Moses, Y., & Vardi, M. Y. (1995).

Reasoning About Knowledge . MIT Press.Fudenberg, D., & Levine, D. K. (1998).

The Theory of Learning in Games . MIT Press.78 F

RAMEWORK FOR S EQUENTIAL P LANNING IN M ULTI -A GENT S ETTINGS

Fudenberg, D., & Tirole, J. (1991).

Game Theory . MIT Press.Gmytrasiewicz, P. J., & Durfee, E. H. (2000). Rational coordination in multi-agent environments.

Au-tonomous Agents and Multiagent Systems Journal , (4), 319–350.Harsanyi, J. C. (1967). Games with incomplete information played by ’Bayesian’ players. ManagementScience , (3), 159–182.Hauskrecht, M. (2000). Value-function approximations for partially observable markov decision processes. Journal of Artiﬁcial Intelligence Research , pp. 33–94.Hausktecht, M. (1997).

Planning and control in stochastic domains with imperfect information . Ph.D. thesis,MIT.Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: Theoretical framework and an algo-rithm. In

Fifteenth International Conference on Machine Learning , pp. 242–250.Kadane, J. B., & Larkey, P. D. (1982). Subjective probability and the theory of games.

Management Science , (2), 113–120.Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observablestochastic domains. Artiﬁcial Intelligence , (2), 99–134.Kalai, E., & Lehrer, E. (1993). Rational learning leads to nash equilibrium. Econometrica , pp. 1231–1240.Koller, D., & Milch, B. (2001). Multi-agent inﬂuence diagrams for representing and solving games. In

Seven-teenth International Joint Conference on Artiﬁcial Intelligence , pp. 1027–1034, Seattle, Washington.Li, M., & Vitanyi, P. (1997).

An Introduction to Kolmogorov Complexity and Its Applications . Springer.Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In

Proceedingsof the International Conference on Machine Learning .Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed markov decision processes.

Annals of Operations Research , (1-4), 47–66.Madani, O., Hanks, S., & Condon, A. (2003). On the undecidability of probabilistic planning and relatedstochastic optimization problems. Artiﬁcial Intelligence , , 5–34.Mertens, J.-F., & Zamir, S. (1985). Formulation of Bayesian analysis for games with incomplete information. International Journal of Game Theory , , 1–29.Monahan, G. E. (1982). A survey of partially observable markov decision processes: Theory, models, andalgorithms. Management Science , 1–16.Myerson, R. B. (1991).

Game Theory: Analysis of Conﬂict . Harvard University Press.Nachbar, J. H., & Zame, W. R. (1996). Non-computable strategies and discounted repeated games.

EconomicTheory , , 103–122.Nair, R., Pynadath, D., Yokoo, M., Tambe, M., & Marsella, S. (2003). Taming decentralized pomdps: Towardsefﬁcient policy computation for multiagent settings. In Proceedings of the Eighteenth InternationalJoint Conference on Artiﬁcial Intelligence (IJCAI-03) .Ooi, J. M., & Wornell, G. W. (1996). Decentralized control of a multiple access broadcast channel. In

Proceedings of the 35th Conference on Decision and Control .Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of markov decision processes.

Mathematicsof Operations Research , (3), 441–450.Russell, S., & Norvig, P. (2003). Artiﬁcial Intelligence: A Modern Approach (Second Edition) . Prentice Hall.Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable markov decisionprocesses over a ﬁnite horizon.

Operations Research , pp. 1071–1088.Stokey, N. L., & Lucas, R. E. (1989).