A Unified Approach to Dynamic Decision Problems with Asymmetric Information - Part I: Non-Strategic Agents
11 A Unified Approach to Dynamic DecisionProblems with Asymmetric Information -Part I: Non-Strategic Agents
Hamidreza Tavafoghi, Yi Ouyang, and Demosthenis Teneketzis
Abstract
We study a general class of dynamic multi-agent decision problems with asymmetric informationand non-strategic agents, which includes dynamic teams as a special case. When agents are non-strategic,an agent’s strategy is known to the other agents. Nevertheless, the agents’ strategy choices and beliefsare interdependent over times, a phenomenon known as signaling . We introduce the notions of privateinformation that effectively compresses the agents’ information in a mutually consistent manner. Basedon the notions of sufficient information, we propose an information state for each agent that is sufficientfor decision making purposes. We present instances of dynamic multi-agent decision problems where wecan determine an information state with a time-invariant domain for each agent. Furthermore, we presenta generalization of the policy-independence property of belief in Partially Observed Markov DecisionProcesses (POMDP) to dynamic multi-agent decision problems. Within the context of dynamic teamswith asymmetric information, the proposed set of information states leads to a sequential decompositionthat decouples the interdependence between the agents’ strategies and beliefs over time, and enables usto formulate a dynamic program to determine a globally optimal policy via backward induction.
A preliminary version of this paper will appear in the Proceeding of the 57th IEEE Conference on Decision and Control(CDC), Miami Beach, FL, December 2018 [1].H. Tavafoghi is with the Department of Mechanical Engineering at the University of California, Berkeley (e-mail:[email protected]). Y. Ouyang is with Preferred Networks America, Inc. (e-mail: [email protected]). D.Teneketzis is with the Department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor(e-mail: [email protected])This work was supported in part by the NSF grants CNS-1238962, CCF-1111061, ARO-MURI grant W911NF-13-1-0421, andARO grant W911NF-17-1-0232.
November 23, 2018 DRAFT a r X i v : . [ c s . M A ] D ec I. I
NTRODUCTION
A. Background and Motivation
Dynamic multi-agent decision problems with asymmetric information have been used to modelmany situations arising in engineering, economic, and socio-technological applications. In theseapplications many decision makers/agents interact with each other as well as with a dynamicsystem. They make private imperfect observations over time, and influence the evolution of thedynamic system through their actions that are determined by their strategies. An agent’s strategyis defined as a decision rule that the agent uses to choose his action at each time based on hisrealized information at that time.In this paper, we study a general class of dynamic decision problems with non-strategic agents .We say an agent is non-strategic if his strategy (not his specific action) is known to the otheragents. In a companion paper [2] we study dynamic decision problems with strategic agents where an agent’s strategy is his private information and not known to the other agents.We consider an environment with controlled Markovian dynamics, where, given the agents’actions at every time, the system state at the next time is a stochastic function of the currentsystem state. The instantaneous utility of each agent depends on the agents’ joint actions aswell as the system state. At every time, each agent makes a private noisy observation thatdepends on the current system state and past actions of all agents in the system. Therefore,agents have asymmetric and imperfect information about the system history. Moreover, eachagent’s information depends on other agents’ past actions and strategies; this phenomenon isknown as signaling in the control theory literature. In such problems, the agents’ decisions andinformation are coupled and interdependent over time because (i) an agent’s utility depends on theother agents’ actions, (ii) the evolution of the system state depends, in general, on all the agents’actions, (iii) each agent has imperfect and asymmetric information about the system history, and(iv) at every time an agent’s information depends, in general, on the agents’ (including himself)past actions and strategies.There are two main challenges in the study of dynamic multi-agent decision problems withasymmetric information. First, because of the coupling and interdependence among the agents’decisions and information over time, we need to determine the agents’ strategies simultaneouslyfor all times. Second, as the agents acquire more information over time, the domains of theirstrategies grow.
November 23, 2018 DRAFT
In this paper, we propose a general approach for the study of dynamic decision problemswith non-strategic agents and address these two challenges. We propose the notion of sufficientinformation and provide a set of conditions sufficient to characterize a compression of the agents’private and common information in a mutually consistent manner over time. We show that sucha compression results in an information state for each agent’s decision making problem. Weshow that restriction to the set of strategies based on this information state entails no loss ofgenerality in dynamic decision problems with non-strategic agents.We identify specific instances of dynamic decision problems where we can discover a set ofinformation states for the agents that have time-invariant domain. Within the context of dynamicteams, we further demonstrate that the notion of sufficient information leads to a sequentialdecomposition of dynamic teams. This sequential decomposition results in a dynamic programthe solution of which determines the agents’ globally optimal strategies . B. Related Literature
The Partially Observed Markov Decision Processes (POMDPs), i.e. centralized stochasticcontrol problems, present the simplest form of dynamic decision problems with single agent[3], [4]. To analyze and identify properties of optimal strategies in POMDPs the notion of information state is introduced as the agent’s belief about the current system state conditioned onhis information history. The information state provides a way to compress the agent’s informationover time that is sufficient for the decision-making purposes. When the agent has perfect recall,this information state is independent of the agent’s strategies over time; this result is known asthe policy-independence belief property [3].Dynamic multi-agent decision problems with non-strategic agents are considerably more dif-ficult compared to their centralized counterparts. This is because, due to signaling, they are(in general) non-convex functional optimization problems (see [5]–[8]). The difficulties presentin these problems were first illustrated by Witsenhausen [9], who showed that in a simpledynamic team problem with Gaussian primitive random variables and quadratic cost functionwhere signaling occurs, linear strategies are suboptimal (contrary to the corresponding centralizedproblem where linear strategies are optimal). Subsequently, many researchers investigated controlproblems with various specific information structures such as: partially nested ( [10]–[15] andreferences therein), stochastic nested [16], randomized partially nested [17], delayed sharing ([11], [18]–[20] and references therein), information structures possessing the i-partition property
November 23, 2018 DRAFT or the s-partition property [21], the quadratic invariance property [22], and the substitutabilityproperty [23].Currently, there are three approaches to the analysis of dynamic multi-agent decision problemswith non-strategic agents: the agent-by-agent approach [24], the designer’s approach [25], and the common information approach [26]. We provide a brief discussion of these approaches here. Wediscuss them in details in Section VI-B, where we compare them with the sufficient informationapproach we present in this paper and show that our approach is distinctly different from them.The agent-by-agent approach [24], is an iterative method. At each iteration, we pick an agentand fix the strategy of all agents except that agent, and determine the best response for thatagent and update his strategy accordingly. We proceed in a round robin fashion among theagents until a fixed point is reached, that is, when no agent can improve his performance byunilaterally changing his strategy. The designer’s approach [25], considers the decision problemfrom the point of view of a designer who knows the system model and the probability distributionof the primitive random variables, and chooses the control strategies for all agents withouthaving an information about the realization of the primitive random variables. The commoninformation approach [26], assumes that at each time all agents possess private information andshare some common information; it uses the common information to coordinate the agents’strategies sequentially over time.
C. Contribution
We develop a general methodology for the study and analysis of dynamic decision problemswith asymmetric information and non-strategic agents. Our model includes problems with non-classical information structures [19] where signaling is present. We propose an approach thateffectively compresses the agents’ private and common information in a mutually consistentmanner. As a result, we offer a set of information states for the agents which are sufficientfor decision making purposes. We characterize special instances where we can identify aninformation state with a time-invariant domain. Based on the proposed information state, weprovide a sequential decomposition of dynamic teams over time. We show that the methodologydeveloped in this paper generalizes the existing results for dynamic teams with non-classicalinformation structure. Our results in this paper, along those appearing in the companion paper[2] present a set of information states sufficient for decision making in strategic and non-strategic
November 23, 2018 DRAFT settings. Therefore, we provide a unified approach to decision making problems that can be usedto study dynamic games and dynamic teams as well as dynamic games among teams of agents.
D. Organization
The rest of the paper is organized as follows. In Section II, we describe the model andpresent few examples. In Section III, we discuss the main challenges that are present in dynamicmulti-agent decision problems with non-strategic agents. We present the sufficient informationapproach in Section IV. We present the main results of the paper in Section V. We discuss an openproblem associated with the sufficient information approach in Section VI-A. In Section VI-B,we compare the sufficient information approach with the existing approaches in the literature.We provide a generalization of the sufficient information approach in Section VII. We presentan extension of our results to infinite-horizon dynamic multi-agent decision problems with non-strategic agents in Section VIII. We conclude in Section IX. The proofs of all the theorems andlemmas appear in the Appendix.
Notation
Random variables are denoted by upper case letters, their realizations by the correspondinglower case letters. In general, subscripts are used as time index while superscripts are used toindex agents. For t ≤ t , X t : t (resp. f t : t ( · ) ) is the short hand notation for the random variables ( X t ,X t +1 , ...,X t ) (resp. functions ( f t ( · ) , . . . ,f t ( · )) ). When we consider a sequence of randomvariables (resp. functions) for all time, we drop the subscript and use X to denote X T (resp. f ( · ) to denote f T ( · ) ). For random variables X t , . . . ,X Nt (resp. functions f t ( · ) , . . . ,f Nt ( · ) ),we use X t := ( X t , . . . ,X Nt ) (resp. f t ( · ) := ( f t ( · ) , . . . ,f Nt ( · )) ) to denote the vector of the setof random variables (resp. functions) at t , and X − nt := ( X t , . . . ,X n − t ,X n +1 t , . . . ,X Nt ) (resp. f − nt ( · ) := ( f t ( · ) , . . . ,f n − t ( · ) ,f n +1 t ( · ) , . . . ,f Nt ( · )) ) to denote all random variables (resp. functions)at t except that of the agent indexed by n . P ( · ) and E ( · ) denote the probability and expectationof an event and a random variable, respectively. For a set X , ∆( X ) denotes the set of allbeliefs/distributions on X . For random variables X, Y with realizations x,y , P ( x | y ) := P ( X = x | Y = y ) and E ( X | y ) := E ( X | Y = y ) . For a strategy g and a belief (probability distribution) π , we use P gπ ( · ) (resp. E gπ ( · ) ) to indicate that the probability (resp. expectation) depends on thechoice of g and π . We use { X = x } to denote the indicator function for event X = x . For sets A and B we use A \ B to denote all elements in set A that are not in set B . For random variables X and Y we write X dist. = Y when X and Y have an identical probability distribution. November 23, 2018 DRAFT
II. M
ODEL
1) System dynamics:
Consider N non-strategic agents who live in a dynamic Markovian worldover a horizon T := { , , ...,T } , T < ∞ . Let X t ∈ X t denote the state of the world at t ∈ T . Attime t , each agent, indexed by i ∈ N := { , , ...,N } , chooses an action a it ∈ A it , where A it denotesthe set of available actions to him at t . Given the collective action profile A t := ( A t , ...,A Nt ) , thestate of the world evolves according to the following stochastic dynamic equation, X t +1 = f t ( X t , A t , W xt ) , (1)where W x T − is a sequence of independent random variables. The initial state X is a randomvariable that has a probability distribution η ∈ ∆( X ) with full support.At every time t ∈ T , before taking an action, agent i receives a noisy private observation Y it ∈ Y it of the current state of the world X t and the action profile A t − , given by Y it = O it ( X t , A t − , W it ) , (2)where W i T , i ∈ N , are sequences of independent random variables. Moreover, at every t ∈ T ,all agents receive a common observation Z t ∈ Z t of the current state of the world X t and theaction profile A t − , given by Z t = O ct ( X t , A t − , W ct ) , (3)where W c T , is a sequence of independent random variables. We note that the agents’ actions A t − is commonly observable at t if A t − ⊆ Z t . We assume that the random variables X , W x T − , W c T , and W i T , i ∈ N are mutually independent.
2) Information structure:
Let H t ∈ H t denote the aggregate information of all agents at time t . Assuming that agents have perfect recall, we have H t = { Z t , Y N t , A N t − } , i.e. H t denotesthe set of all agents’ past observations and actions. The set of all possible realizations of theagents’ aggregate information is given by H t := (cid:81) τ ≤ t Z τ × (cid:81) i ∈N (cid:81) τ ≤ t Y iτ × (cid:81) i ∈N (cid:81) τ 3) Strategies and Utilities: Let H it := { C t ,P it } ∈ H it denote the information available to agent i at t , where H it denote the set of all possible realizations of agent i ’s information at t . Agent i ’s strategy g i := { g it , t ∈ T } , is defined as a sequence of mappings g it : H it → ∆( A it ) , t ∈ T ,that determine agent i ’s action A it for every realization h it ∈ H it of his history at t ∈ T .Agent i ’s instantaneous utility at t depends on the state of the world X t and the collectiveaction profile A t and is given by u it ( X t , A t ) . Therefore, agent i ’s total utility over the horizon T is given as U i ( X T , A T ) := (cid:88) t ∈T u it ( X t , A t ) . (4)We assume that agents are non-strategic . That is, each agent’s, say i ’s, i ∈ N , strategy choice g i is known to other agents. We note that these non-strategic agents may have different utilitiesover time. Therefore, the model includes a team of agents sharing the same utilities (see SectionsV) as well as agents with general non-identical utilities. In [2] we build on our results in thispaper to study dynamic decision problems with strategic agents where an agent may deviateprivately from the commonly believed strategy, and gain by misleading the other agents.To avoid measure-theoretic technical difficulties and for clarity and convenience of exposition,we assume that all the random variables take values in finite sets. Assumption 1. (Finite game) The sets X t , Z t , Y it , A it , i ∈ N , t ∈ T , are finite. Special Cases: We present several instances of dynamic decision problems with asymmetric information thatare special cases of the general model described above. 1) Real-time source coding-decoding [27]: Consider a data source that generates a randomsequence { X ,..., X T } that is k -th order Markov, i.e. for every sequence of realizations x T , P { X t + k : T = x t + k : T | x t + k − } = P { X t + k : T = x t + k : T | x t : t + k − } for t ≤ T − k . There exists an encoder(agent ) who observes X t at every time t ; the encoder has perfect recall. At every time t ,based on his available data { X ,..., X t } , the encoder transmits a signal M t ∈ M t through anoiseless channel to a decoder (agent ), where M t denotes the transmission alphabet. At thereceiving end, at every time t , the decoder wants to estimate the value of X t − − δ (with delay δ )as ˆ X t − − δ based on his available data M t − ; we assume that the decoder has perfect recall. The November 23, 2018 DRAFT encoder and decoder choose their joint coding-decoding policy so as to minimize the expectedtotal distortion function given by (cid:80) Tt =2+ δ d t ( X t , ˆ X t ) , where d t ( · , · ) denotes the instantaneousdistortion function. To capture the above-described model within the context of our model,we need to define an augmented system state ˜ X t that includes the last max( k, δ + 1) statesrealizations as ˜ X t := { X t − max( k,δ +1)+1 ,..., X t } . Moreover, the encoder’s (agent ’s) observation isgiven by Y t = O t ( ˜ X t , A t − ) = X t and the decoder’s (agent ’s) observation is given by Y t = O t ( ˜ X t ,A t − ) = M t − , where ( A t ,A t ) = ( M t , ˆ X t − − δ ) . The encoder’s and decoder’s instantaneousutility are given by a distortion function u team t ( ˜ X t ,A t ) = d t ( X t − − δ , ˆ X t − − δ ) . 2) Delayed sharing information structure [18]–[20], [28]: Consider a N -agent decision prob-lem where agents observe each others’ observations and actions with d -step delay. We note thatin our model we assume that the agents’ common observation Z t at t is only a function of X t andand A t − . Therefore, to describe the decision problem with delayed sharing information structurewithin the context of our model we need to augment our state space to include the agents’ last d observations and actions as part of the augmented state. Define ˜ X t := { X t , M t , M t , ..., M dt } as the augmented system state where M it := { A t − i , Y t − i } ∈ A t − i × Y t − i , i ∈ N ; that is, M it serves as a temporal memory for the agents’ observations Y t − i and actions A t − i at t − i . Then,we have ˜ X t +1 = { X t +1 , M t +1 , M t +1 , ..., M dt +1 } = { f t ( X t , A t , W xt ) , ( Y t , A t ) , M t , ..., M d − t } and Z t = { M dt } = { Y t − d , A t − d } . 3) Real-time multi-terminal communication [29]: Consider a real-time communication systemwith two encoders (agents and ) and one receiver (agent ). The two encoders make distinctobservations X t and X t of a Markov source. The encoders’ observation are conditionallyindependent Markov chains. That is, there is an unobserved random variable variable R suchthat P { X ,X ,R } = P { X | R } P { X | R } P { R } , and P { X t +1 , X t +1 | X t , X t , R } = P { X t +1 | X t , R } P { X t +1 | X t , R } . Markovsource X t , X t Encoder 1 g t ( X t ,M t − ) Encoder 2 g t ( X t ,M t − ) Channel 1 Q t ( Y t | M t − ) Channel 2 Q t ( Y t | M t − ) Receiver g t ( Y t − ) X t X t M t M t Y t Y t ˆ X t Each encoder encodes, in real-time, its observations into a sequence of discrete symbols and November 23, 2018 DRAFT sends it through a memoryless noisy channel characterized by a transition matrix Q it ( ·|· ) , i = 1 , .The receiver wants to construct, in real time, an estimate ˆ X t of the state of the Markov sourcebased on the channels’ output Y t , Y t . All agents have the same instantaneous utility given bya distortion function d t ( X t , ˆ X t ) . 4) Optimal remote and local controller [30], [31]: Consider a decentralized control problemfor a Markovian plant with two controllers, a local controller (agent ) and a remote controller(agent ). Plant f t ( X t , A t , A t ) Remote Controller g t ( Y t , A t − ) Local Controller g t ( X t , Y t , A t − ) A t A t X t X t Y t The local controller observes perfectly the state X t of the Markov chain, and sends his obser-vation through a packet-drop channel to the remote controller. The transmission is successful, i.e. Y t = X t , with probability p > and is not successful, i.e. Y t = ∅ , with probability − p ≥ .We assume that the local controller receives an acknowledgment every time the transmission issuccessful. The controllers’ joint instantaneous utility is given by a u team t ( X t ,A t ,A t ) .III. S TRATEGIES AND B ELIEFS In a dynamic decision problem with asymmetric information agents have private informationabout the evolution of the system, and they do not observe the complete history { H t , X t } , t ∈ T .Therefore, at every time t ∈ T , each agent, say agent i ∈ N , needs to form (i) an appraisal aboutthe current state of the system X t and the other agents’ information H − it (appraisal about thehistory), and (ii) an appraisal about how other agents will play in the future (appraisal about thefuture), so as to evaluate the performance of his strategy choices.When agents are non-strategic, the agents’ strategies g N T are known to all agents. Therefore,agent i ∈ N can form these appraisals by using his private information H it along with thecommonly known strategies g − i . Specifically, agent i can utilize his own information H it at t ∈ T , along with (i) the past strategies g t − and (ii) the future strategies g t : T to form theseappraisals about the history and the future of the overall system, respectively. As a result, the November 23, 2018 DRAFT0 outcome of decision problems with non-strategic agents can be fully characterized by the agents’strategy profile g . However, we need to know the entire strategy profile g for all agents and at all times toform these appraisals so as to evaluate the performance of an arbitrary strategy g it , at any time t ∈ T and for any agent i ∈ N . Therefore, we must work with the strategy profile g as a wholeirrespective of the length of the time horizon T . Consequently, the computational complexity ofdetermining a strategy profile that satisfies certain conditions ( e.g. an optimal strategy profile inteams) grows doubly exponentially in |T | since the domain of agents’ strategy ( i.e. |H it | ) andthe number of temporally interdependent decision problems (one for each time instance) growswith |T | . As a result, the analysis of such decision problems is very challenging in general [32].An alternative conceptual approach for the analysis of decision problems is to define a beliefsystem µ along with the strategy profile g . For every agent i ∈ N , at every time t ∈ T ,define µ it ( h it ) as the agent i ’s belief about { X t ,P − it } conditioned on the realization of h it ,that is, µ ( h it )( x t ,p − i ) := P g t − { X t = x t ,P − it = p − it | h it } . The belief µ it provides an intermediateinstrument that encapsulates agent i ’s appraisal about the past. Therefore, agent i can evaluate theperformance of any action a it using only the belief µ it ( h it ) along with the future strategy profile g t : T . However, the belief µ ( h it )( x t ,p − i ) is dependent on g t − in general since the probabilitydistribution P g t − { X t = x t ,P − it = p − it | h it } depends on g t − . Therefore, the introduction of abelief system offers an equivalent problem formulation that does not necessarily break the inter-temporal dependence between g t − and g t : T and does not simplify the analysis of decisionproblems.Nevertheless, the definition of a belief system has been shown to be suitable for the analysis ofsingle-agent decision making problems (POMDP) for the following reasons. First, in POMDPs,under perfect recall, the probability distribution P g t − { X t = x t | h t } is independent of g t − ;this is known as the policy-independence property of beliefs in stochastic control. Second, thecomplexity of the belief function does not grow over time since at every time t the agentonly needs to form a belief about X t , which has a time-invariant domain. As a result, we cansequentially decompose the problem over time to a sequence of static decision problems withtime-invariant complexity; such a decomposition leads to a dynamic program. At each stage We discuss the decision problems with strategic agents in the companion paper [2].When agents are strategic each agentmay have incentive to deviate an any time from the strategy the other agents commonly believe he uses if it is profitable to him(see [2] for more discussion). November 23, 2018 DRAFT1 t ∈ T of the dynamic program, we specify g t by determining an action for each realization ofthe belief µ t ( · ) fixing the future strategies g t +1: T . Therefore, the computational complexity ofthe analysis is reduced from being exponential in T to linear in T .Unfortunately, the above approach for POMDPs does not generalize to decision problems withmany agents. This is because of three reasons. First, with many agents, currently in the literature,there exists no information state for each agent that provides a compression of the agent’sinformation, in a mutually consistent manner among the agents, that is sufficient for decisionmaking purposes. Therefore, an agent’s, say agent i ’s, strategy g it has a growing domain overtime. Second, at every time t ∈ T , each agent i ∈ N needs to form a belief about the system state X t as well as the other agents’ private information P − it that has a growing domain. Therefore thecomplexity of belief functions grows over time. Third, in decision problems with many agents,the policy-independence property of belief does not hold in general and the agents’ beliefs atevery time t depend on the past strategy profile g t − . Therefore, the agents’ beliefs µ Nt ( · ) arecorrelated with one another. This correlation depends on g t − , and thus, it is not known a priori.Consequently, if we follow an approach similar to that of POMDP to sequentially decomposethe problem, we need to solve the decision problem at every stage for every arbitrary correlationamong the agents’ belief functions, and such a problem is not tractable. Hence, the methodologyproposed for the study of POMDPs is not directly applicable to decision problems with manyagents and non-classical information structures.In the sequel, we propose a notion of sufficient private information and sufficient commoninformation as a mutually consistent compression of the agents’ information for decision makingpurposes. Therefore, we address (partially) the first two problems on the growing domain of theagents’ beliefs and strategies. We provide instances of decision problems where we can discovertime-invariant information state for each agent. We then utilize the agents’ sufficient commoninformation as a coordination instrument, and thus, capture the implicit correlation among theagents’ beliefs over time. Accordingly, we present a sequential decomposition of the originaldecision problems such that at every stage the complexity of the decision problem is similarto that of a static decision multi-agent problem and the size of state variable at every stage isproportional to the dimension of the sufficient private information; thus, we (partially) addressthe third problem discussed above. Alternatively, one can consider arbitrary correlation among the agents’ information rather than their beliefs. This is the mainidea that underlies the designer’s approach proposed by Winstenhausen [25]. Please see Section VI-B for more discussion. November 23, 2018 DRAFT2 IV. S UFFICIENT I NFORMATION We present the sufficient information approach and characterize an information state that resultsfrom compressing the agents’ private and common information in a mutually consistent manner.Therefore, we introduce a class of strategy choices that are simpler than general strategies asthey require agents to keep track of only a compressed version of their information over time.We proceed as follows. In Section IV-A we provide conditions sufficient to determine the subsetof private information an agent needs to keep track of over time for decision making purposes. InSection IV-B, we introduce the notion of sufficient common information as a compressed versionof the agents’ common information that along with sufficient private information provides aninformation state for each agent. We then show, in Section V, that this compression of the agents’private and common information provides a sufficient statistic in dynamic decision problemswith non-strategic agents. In Section VII, we provide a generalization of sufficient informationapproach presented here. A. Sufficient Private Information The key ideas for compressing an agent’s private information appear in Definitions 1 and 2below. To motivate these definitions we first consider the decision problem with single agent,that is, a Partially Observed Markov Decision Process (POMDP), which is a special case of themodel described in Section II where N = 1 , H t = P t and C t = ∅ for all t ∈ T .In a POMDP, the agent’s belief about the system state X t conditioned on his history realization h it is an information state. We highlight the three main proprieties that underlie the definition ofinformation state in POMDP (see [33], [34]): (1) the information state can be updated recursively,that is, at any time t the information state at t can be written as a function of the informationstate at t − and the new information that becomes available at t , (2) the agent’s belief aboutthe information state at the next time conditioned on the current information state and action isindependent of his information history, and (3) at any time t and for any arbitrary action theagent’s expected instantaneous utility conditioned on the information state is independent of hisinformation history.We generalize the key properties of information state for POMDPs, described above, todecision problems with many agents. We propose a set of conditions sufficient to compressthe agents’ private information in two steps. First, we consider a decision problem with manyagents where there is no signaling among them. Motivated by the definition of information state November 23, 2018 DRAFT3 in POMDPs, we describe conditions sufficient to determine a compression of the agents’ privateinformation (Definition 1). Next, we build on Definition 1 as an intermediate conceptual step,and consider the case where agents are aware of possible signaling among them. Accordingly,we present a set of conditions sufficient to determine a compression of the agents’ privateinformation in decision problems with many agents (Definition 2) .Therefore, we first characterize subsets of an agent’s private information that are sufficient forthe agent’s decision making process when there is no signaling among the agents. Definition 1 (Private payoff-relevant information) . Let P i,prt = ¯ ζ it ( P it , C t ) denote a private signalthat agent i ∈ N forms at t ∈ T based on his private information P it and common information C t . We say P i,prt is a private payoff-relevant information for agent i if, for all open-loop strategyprofile ( A N T = ˆ a N T ) and for all t ∈ T ,(i) it can be updated recursively as P i,prt = ¯ φ it ( P i,prt − , H it \ H it − ) if t (cid:54) = 1 , (ii) for all realizations { c t , p it } it satisfies P ( A N T =ˆ a N T ) (cid:110) p i,prt +1 (cid:12)(cid:12)(cid:12) p it ,c t ,a t (cid:111) = P ( A N T =ˆ a N T ) (cid:110) p i,prt +1 (cid:12)(cid:12)(cid:12) p i,prt ,c t ,a t (cid:111) , (iii) for all realizations { c t , p it } ∈ C t × P it such that P ( A N T =ˆ a N T ) { c t , p it } > , E ( A N t − =ˆ a N t ) (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) c t ,p it ,a t (cid:111) = E ( A − i t − =ˆ a − i t ) (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) c t , p i,prt , a t (cid:111) . By assuming that all other agents play open-loop strategies we remove the interdependencebetween agents − i ’s strategy choices and agent i ’s information structure, thus, we eliminate signaling among the agents. Fixing the open-loop strategies of agents − i , agent i faces acentralized stochastic control problem. Definition 1 says that P i,prt , t ∈ T , is a private payoff-relevant information for agent i if (i) it can be recursively updated, (ii) P i,prt includes allinformation in P it that is relevant to P i,prt +1 and (iii) agent i ’s instantaneous conditional expectedutility at any t ∈ T is only a function of C t ,P i,prt , and his action A it at t . These three conditionsare similar to properties (1)-(3) for an information state in POMDP, but they concern only agent i ’s private information P it instead of the collection H it = { C t ,P it } of his private and common November 23, 2018 DRAFT4 information. While the definition of private payoff-relevant information suggests a possible way to compressthe information required for an agent’s decision making process, it assumes that other agentsplay open-loop strategies and do not utilize the information they acquire in real-time for decisionmaking purposes ( i.e. no signaling). However, open-loop strategies are not in general optimalfor agents − i . As a result, to evaluate the performance of any strategy choice g i agent i needsalso to form a belief about the information that other agents utilize to make decisions. Definition 2 (Sufficient private information) . We say S it = ζ it ( P it , C t ; g t − ) , i ∈ N , t ∈ T , issufficient private information for the agents if,(i) it can be updated recursively as S it = φ it ( S it − , H it \ H it − ; g t − ) for t ∈ T \{ } , (5) (ii) for any strategy profile g and for all realizations { c t , p t , p t +1 , z t +1 , a t } ∈ C t ×P t ×P t +1 ×Z t +1 of positive probability, P g t (cid:110) s t +1 ,z t +1 (cid:12)(cid:12)(cid:12) p t ,c t ,a t (cid:111) = P g t (cid:110) s t +1 ,z t +1 (cid:12)(cid:12)(cid:12) s t ,c t ,a t (cid:111) , (6) where s Nτ = ζ Nτ ( p Nτ ,c τ ; g τ − ) for τ ∈ T ;(iii) for every strategy profile ˜ g of the form ˜ g := { ˜ g it : S it × C t → ∆( A it ) , i ∈ N ,t ∈ T } and a t ∈ A t , t ∈ T ; E ˜ g t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) c t ,p it ,a t (cid:111) = E ˜ g t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) c t ,s it ,a t (cid:111) , (7) for all realizations { c t ,p it }∈ C t × P it of positive probability where s Nτ = ζ Nτ ( p Nτ ,c τ ; ˜ g τ − ) for τ ∈ T ;(iv) given an arbitrary strategy profile ˜ g of the form ˜ g := { ˜ g it : S it × C t → ∆( A it ) , i ∈ N , t ∈ T } , i ∈ N , and t ∈ T , P ˜ g t − (cid:110) s − it (cid:12)(cid:12)(cid:12) p it ,c t (cid:111) = P ˜ g t − (cid:110) s − it (cid:12)(cid:12)(cid:12) s it ,c t (cid:111) , (8) We note that we interpret a centralized control problem as a special case of our model where N =1 , H t = P t and C t = ∅ forall t ∈T , Definition 1 coincides with the definition of information state for the single agent decision problem. We would like topoint out that conditions (i)-(iii) can have many solutions including the trivial solution P i,prt = P it . November 23, 2018 DRAFT5 for all realizations { c t ,p it } ∈ C t ×P it of positive probability where s Nτ = ζ Nτ ( p Nτ ,c τ ; ˜ g τ − ) for τ ∈ T . There are four key differences between the definition of sufficient private information and thatof private payoff relevant information. First, we allow that the definition and the update ruleof sufficient information S it to depend on the agents’ strategies g t − . Second, comparing topart (ii) of Definition 1, part (ii) of Definition 2 requires that sufficient information S t includesall information relevant to the realization of Z t +1 in addition to the information relevant to therealization of S t +1 . As we discuss further in Section VI, this is because when signaling occurs ina multi-agent decision problems agents need to have a consistent view about future commonlyobservable events. Third, comparing part (iii) of Definition 2 to part (iii) of Definition 1, wenote that the probability measures in Definition 2 depend on the strategy profile g instead of theope-loop strategy profile ( A N T = ˆ a N T ) . Fourth, in part (iv) of Definition 2 there is an additionalcondition requiring that agent i ’s sufficient private information S it must be rich enough so thathe can form beliefs about agents − i ’s sufficient private information S − it ; such a condition isabsent in Definition 1.In general, the notion of sufficient private information S Nt is more restrictive than that ofprivate payoff relevant information P N,prt . This is because, S Nt , t ∈ T , needs to satisfy theadditional condition (iv), and furthermore, open-loop strategies are a strict subset of closed loopstrategies. Definition 2 provides (sufficient) conditions under which agents can compress theirprivate information in a “mutually consistent’ manner. We would like to point out that conditions(i)-(iv) of Definition 2 can have many solutions including the trivial solution S it = P it . B. Sufficient Common Information Based on the characterization of sufficient private information, we present a statistic (com-pressed version) of the common information C t that agents need to keep track of over time fordecision making purposes.Fix a choice of sufficient private information S Nt , t ∈ T . Define S it to be the set of allpossible realizations of S it , and S t := (cid:81) Ni =1 S it . Given the agents’ strategy profile g , let γ t : C t → ∆( X t × S t ) denote a mapping that determines a conditional probability distribution over the We do not discuss the possibility of finding a minimal set of sufficient private information in this chapter, and leave it forfuture research as such investigation is beyond the scope of this chapter. November 23, 2018 DRAFT6 system state X t and all the agents’ sufficient private information S t conditioned on the commoninformation C t at time t as γ t ( c t )( x t , s t ) = P g t − { X t = x t , S t = s t | c t } , (9)for all c t ∈ C t , x t ∈ X t , s t ∈ S t .We call the collection of mappings γ := { γ t , t ∈ T } a sufficient information based beliefsystem (SIB belief system). Note that γ t is only a function of the common information C t , andthus, it is computable by all agents. Let Π γt := γ t ( C t ) denote the (random) common informationbased belief that agents hold under belief system γ at t . We can interpret Π γt as the common beliefthat each agent holds about the system state X t and all the agents’ (including himself) sufficientprivate information S t at time t . We call the SIB belief Π t a sufficient common information forthe agents. In the rest of the paper, we write Π t and drop the superscript γ whenever such asimplification in notation is clear. Moreover, we use the terms sufficient common informationand SIB belief interchangeably. C. Sufficient Information based Strategy The combination of sufficient private information S Nt and sufficient common information(the SIB belief) Π t offers a mutually consistent compression of the agents’ private and commoninformation. Consider a class of strategies that are based on the information given by (Π t , S it ) foreach agent i ∈ N at time t ∈ T . We call the mapping σ it : ∆( X t × S t ) × S it → ∆( A it ) a SufficientInformation Based (SIB) strategy for agent i at time t . A SIB strategy σ it determines a probabilitydistribution for agent i ’s action A it at time t given his information (Π t , S it ) . A SIB strategy isa strategy where agents only use the sufficient common information Π t = γ t ( C t ) (instead ofcomplete common information C t ), and the sufficient private information S it = ζ it ( P it , C t ; g t − ) (instead of complete private information P it ). A collection of SIB strategies { σ T , ..., σ N T } iscalled a SIB strategy profile σ . The set of SIB strategies is a subset of general strategies, definedin Section II, as we can define, g ( σ,γ ) ,it ( h it ) := σ it ( π γt , s it ) ∀ t ∈ T (10)We note that from Definition 2 and (9), the realizations π t and s Nt at t only depends on g t − .Therefore, strategies g ( σ,γ ) ,it , defined above via (10) needs to be determined iteratively as follows;for t = 1 , g ( σ,γ ) ,i ( h i ) = σ i ( π γ , ζ i ( P i , C )) ; for t = 2 , g ( σ,γ ) ,i ( h i ) = σ i ( π γ , ζ i ( P i , C ; g ( σ,γ )1 )) ; ... ; November 23, 2018 DRAFT7 for t = T , g ( σ,γ ) ,it ( h it ) = σ it ( π γt , ζ i ( P it , C t ; g ( σ,γ ) t − )) . Therefore, strategy g ( σ,γ ) ,it is well-defined for all t ∈ T and i ∈ N . D. Sufficient Information based Update Rule When the agents play a SIB strategy profile σ , it is possible to determine the SIB belief Π t recursively over time based on Π t − and the new common information Z t via Bayes’ rule. Let ψ σ t − t : ∆( X t − ×S t − ) ×Z t → ∆( X t ×S t ) describe such a update rule for time t + 1 ∈ T / { } sothat Π t = ψ σ t − t (Π t − , Z t ) . (11)We note that the SIB update rule ψ σ t − t depends on the SIB strategy profile σ t − at t − . Inthe rest of the paper, we drop the superscript σ whenever such a simplification in notation isclear. E. Special Cases We consider the special cases (1)-(3) of the general model we presented in Section II,and identify the sufficient private information S N T ; we discuss the application of sufficientinformation approach to special case (4) in Section VII. 1) Real-time source coding-decoding: The encoder’s and decoders’ private information aregiven by P t = { X t } and P t = { ˆ X t − − δ } , respectively. The agents’ common information isgiven by C t = { M t − } . We can verify that S t = ˜ X = { X t − max( k,δ +1)+1 ,..., X t } and S t = ∅ satisfythe conditions of Definition 2 ; this is similar to the structural results in [27, Sections III andVI]. Consequently, the common information based belief is Π t = P g { X t − max( k,δ +1)+1: t | M t − } . 2) Delayed sharing information structure: We have P it = { Y it − d +1: t ,A it − d +1: t } and C t = { Y t − d ,A t − d } . Since we do not assume any specific structure for the system dynamics and the agents’observations, agent i ’s complete private information P it is payoff-relevant for him. Therefore,we set S it = P it . Consequently, we have Π t = P g { X t ,Y t − d +1: t ,A t − d +1: t | Y t − d ,A t − d } . The abovesufficient information appears in the first structural result in [18]. 3) Real-time multi-terminal communication: We have P t = { X t ,M t − } , P t = { X t ,M t − } , P t = { Y t ,Y t , ˆ X t − } , and C t = ∅ . It is easy to verify that S t = ( X t , P { R | X t } , P { Y t − | M t − } ) , S t = ( X t , P { R | X t } , P { Y t − | M t − } ) , and S t = P t ; this sufficient information corresponds tothe structural results that appear [29]. November 23, 2018 DRAFT8 V. M AIN R ESULTS In this section, we present our main results for the analysis of dynamic decision problems withasymmetric information and non-strategic agents using the notion of sufficient information. Wefirst provide a generalization of the policy-independence property of beliefs to decision problemswith many agents (Theorem 1). Second, we show that the set of SIB strategies are rich enoughso that restriction to them is without loss of generality (Theorem 2). That is, given any strategyprofile g , there exists a SIB strategy profile σ such that every agent gets the same flow ofutility over time under σ as the one under g . Third, we consider dynamic team problems withasymmetric information. We show that using the SIB strategies, we can decompose the problemsequentially over time, formulate a dynamic program, and determine a globally optimal policyvia backward induction (Theorem 3). Theorem 1 (Policy-independence belief property) . (i) Consider a general strategy profile g . If agents − i play according to strategies g − i , thenfor every strategy g i that agent i plays, P g (cid:110) x t , p − it (cid:12)(cid:12)(cid:12) h it (cid:111) = P g − i (cid:110) x t , p − it (cid:12)(cid:12)(cid:12) h it (cid:111) . (12) (ii) Consider a SIB strategy profile σ along with the associated update rule ψ . If agents − i play according to SIB strategies σ − i , then for every general strategy g i that agent i plays, P σ − i ,g i ψ (cid:110) x t , p − it (cid:12)(cid:12)(cid:12) h it (cid:111) = P σ − i ψ (cid:110) x t , p − it (cid:12)(cid:12)(cid:12) h it (cid:111) . (13)Theorem 1 provides a generalization of the policy-independence belief property for the central-ized stochastic control problem [3] to multi-agent decision making problems. Part (i) of Theorem1 states that, under perfect recall, agent i ’s belief is independent of his actual strategy g i . Part (ii)of Theorem 1 refers to the case where agents − i play SIB strategies σ − i and update their SIBbelief according to SIB update rule ψ . The update rule ψ is determined based on ( σ − i , σ i ) viaBayes’ rule, where σ i denotes the SIB strategy that agents − i assume agent i utilizes. Equation(13) states that even if agent i unilaterally and privately deviates from his SIB strategy, his beliefis independent of his actual strategy g i , and only depends on the other agents’s strategy σ − i aswell as the other agents’ assumption about the SIB strategy σ i (or equivalently the SIB update November 23, 2018 DRAFT9 rule ψ ). In POMDPs it is shown that restriction to Markov strategies is without loss of optimality.We provide a generalization of this result to decision problems with many agents. We showthat restriction to SIB strategies is without loss of generality in non-strategic settings given thatthe agents have access to a public randomization device . We say that the agents have accessto a public randomization device if at every time t ∈ T they observe a public random signal ω t that is completely independent of all events and primitive random variables in the decisionproblem and is uniformly distributed on [0 , , and is independent across time. As a result, ingeneral, at every t ∈ T , all agents can condition their actions on the realization of ω t as well astheir own information. In other words, a public randomization device enables the agents to play correlated randomized strategies. We denote by σ it (Π t , S it , ω t ) agent i ’s SIB strategy using thepublic randomization device for every i ∈ N and t ∈ T . Theorem 2. Assume that the non-strategic agents have access to a public randomization device.Then, for any strategy profile g there exists an equivalent SIB strategy profile σ that results inthe same expected flow of utility, i.e. E g (cid:40) T (cid:88) τ = t u iτ ( g Nτ ( H Nτ ) ,X τ ) (cid:41) = E σ (cid:40) T (cid:88) τ = t u Nτ ( σ iτ (Π τ ,S Nτ , ω τ ) ,X τ ) (cid:41) , (14) for all i ∈ N and t ∈ T . We provide an intuitive explanation for the result of Theorem 2 below. For every agent i ∈ N , his complete information history H it at any time t ∈ T consists of two components: (i)one component captures his information about past events that is relevant to the continuationdecision problem; and (ii) another component that, given the first component, captures theinformation about past events that is irrelevant to the continuation decision problem. We showthat the combination of sufficient private information S it and sufficient common information Π t contains the first component. Nevertheless, in general, the agents can coordinate their action byincorporating the second component into their decision since their information about the pastevents is correlated. Let R it denote the part of agent i ’s information H iT that is not captured by (Π t , S it ) . We show that the set of { R t , ..., R Nt } are jointly independent of { (Π t , S it ) , ..., (Π t , S Nt ) } The results of Theorem 1 provides a crucial property for the analysis of decision problems with strategic agents. This isbecause it ensures that an agent’s unilateral deviation does not influence his belief (see the companion paper [2] for more details). November 23, 2018 DRAFT0 (Lemma 2 in the Appendix). Therefore, at every time t ∈ T , we can generate a set of signals { ˜ R t , ..., ˜ R Nt } , one for each agent, using the public randomization device ω so that they areidentically distributed as { R t , ..., R Nt } . Using the signals { ˜ R t , ..., ˜ R Nt } along with the informationstate (Π t , S it ) for every agent i ∈ N , we can thus recreate a (simulated) history that is identicallydistributed to H it . This implies that, given a public randomization device ω , it is sufficient foreach agent i ∈ N to only keep track of (Π t , S it ) instead of his complete history H it , and play aSIB strategy σ i to achieve an identical (in distribution) sequence of outcomes per stage as thoseunder the strategy profile g .The result of Theorem 2 states that the the class of SIB strategies characterizes a set of simpler strategies where the agents only keep track of a compressed version of their information ratherthan their entire information history. Moreover, the restriction to the class of SIB strategies iswithout loss of generality. Thus, along with results appearing in the companion paper [2], theresult of Theorem 2 suggests that the sufficient information approach proposed in this paperpresents a unified methodology for the study of decision problems with many non-strategic orstrategic agents and asymmetric information.We would like to discuss the implication of Theorem 2 for two special instances of our model.First, when N = 1 , there is no need for a public randomization device since the single decisionmaker does not need to correlate the outcome of his randomized strategy with any other agent.Therefore, the result of Theorem 2 states that the restriction to Markov strategies in POMDPsis without loss of generality. Second, when N > and the agents have identical utilities, i.e. dynamic teams, utilizing a public randomization device does not improve the performance. Thisis because, in dynamic teams a randomized strategy profile is optimal if and only if it is optimalfor every realization of the randomization. Therefore, the restriction to SIB strategies in dynamicteams is without loss of optimality.Using the result of Theorem 2, we present below a sequential decomposition of dynamicteams over time. We formulate a dynamic program that enables us to determine a globallyoptimal strategy profile via backward induction. Theorem 3. A SIB strategy profile σ is a globally optimal solution to a dynamic team problemwith asymmetric information if it solves the following dynamic program: V T +1 ( π t +1 ) := 0 , ∀ π t +1 , ∀ i ∈ N ; (15) November 23, 2018 DRAFT1 at every t ∈ T , and for every π t , σ Nt ( π t , · ) ∈ arg max α N : S Nt → ∆( A Nt ) E π t (cid:8) u team t ( X t ,α N ( S Nt )) + V t +1 ( ψ σ t t ( π t ,α N ,Z t +1 )) (cid:9) , (16) V t ( π t ) := max α N : S Nt → ∆( A Nt ) E π t (cid:8) u team t ( X t ,α N ( S Nt )) + V t +1 ( ψ σ t t ( π t ,α N ,Z t +1 )) (cid:9) . (17)The results of Theorems 2 and 3 extend the results of [18], [26] for the study of dynamicteams in two directions. First, they state that restriction to the set of SIB strategies is withoutloss of generality, while the results of [18], [26] only state that this restriction is without lossof optimality. Second, the definition of Common Information Based strategies, first presented in[18], [26], requires the agents to use all of their private information P it , i ∈ N (or all their privatememory that is a predetermined function of their private information if they do not have perfectrecall); the result of Theorem 3 holds for SIB strategies where the agents’ private informationis effectively compressed , thus, it generalizes/extends the definition of CIB strategies proposedin [18], [26]. VI. D ISCUSSION A. Constructive algorithm The sufficient information approach described in Sections IV and V, presents a generalizationof the notion of information state to dynamic multi-agent decision problems with non-classicalinformation structure. Nevertheless, we would like to point out that our approach does not addressall the issues present in the study of dynamic multi-agent decision problems. We discuss themain limitation of our approach below.In POMDPs, an information state with time-invariant domain can be determined by formingthe probability distribution over the system state conditioned on the current information. Ourapproach does not offer an explicit constructive algorithm that determines a mutually-consistentset of information states, one for each agent, with time-invariant domains in dynamic multi-agentdecision problems. Specifically, Definition 2 describes only a set of sufficient conditions that onecan use to evaluate whether a specific compression of agents’ private information is sufficient fordecision making purposes; it does not offer a constructive algorithm to determine a compressionof the agents’ private information that leads to an information state with time-invariant domain.Given a set of sufficient private information with time-invariant domain for the agents, weachieve, through the formation of SIB beliefs, a compression of the agents’ common information November 23, 2018 DRAFT2 that results in a set of information states with time-invariant domains. In Sections II and IV, wepresented instances of multi-agent decision problems where we can discover a set of informationstates with time-invariant domains. Nonetheless, it is not clear if such a set of mutually-consistentinformation states with time-invariant domains exist for every dynamic multi-agent decisionproblem. Therefore, an interesting, but challenging, future direction would be to identify classesof dynamic decision problems with non-classical information structure where we can guaranteethe existence of a set of mutually-consistent information states with time-invariant domains, andprescribe a constructive methodology for their identification. Moreover, we would like to pointout that the sufficient information approach presented here provide sufficient conditions that canbe used to evaluate an educated-guess one may have for specific multi-agent problems. B. Comparison with other Approaches The sufficient information approach proposed in this paper shares similarities and also hasdifferences with existing conceptual approaches to the study of dynamic multi-agent decisionproblems. Below, we briefly discuss these approaches and compare them with the sufficientinformation approach. 1) Comparison with Agent-by-Agent Approach: The agent-by-agent approach proceeds asfollows: start with an initial guess of a strategy profile g for all agents. At each iteration, selectone agent, say agent i . and update his strategy to a best response strategy given the strategy g − i of all other agents. Repeat the process until a fixed point is reached, that is, when no agent canimprove performance by unilaterally changing his strategy.If the above-described iterative process converges, the resulting strategy profile determines an agent-by-agent optimal strategy profile ; however, such an agent-by-agent optimal strategy profile,in general, is not a globally optimal strategy profile [24]. This is because the multi-agent decisionproblems are, in general, not convex in the agents’ strategies [5]. Therefore, the above-describediterative process does not necessarily converge, or it may converge to a locally optimal strategyprofile that is not a globally optimal strategy profile. In contrast to agent-by-agent approach, thesufficient information approach determines a globally optimal strategy profile for multi-agentdecision problems with non-strategic agents.The agent-by-agent approach can be used to discover qualitative properties of optimal strate-gies. Specifically, we fix the strategies of all agents except one, say agent i , to an arbitrary setof strategies g − i , and solve for agent i ’s best response; to determine agent i ’s best response we November 23, 2018 DRAFT3 need to solve a POMDP, where the system state and system dynamics, in general, depend on g − i . If agent i ’s best response possesses a property that holds for every choice of g − i , then aglobally optimal strategy for agent i possesses the same property. In contrast to the agent-by-agent approach where one need to solve a POMDP parameterized by g − i , to discover qualitativeproperties of a globally optimal strategy profile using the sufficient information approach we onlyneed to check the set of conditions appearing in Definition 2 (or equivalently a more generalDefinition 3 that will appear in Section VII).Moreover, using the sufficient information approach we can discover qualitative propertiesof optimal strategies that cannot be discovered by the agent-by-agent approach. For instance,consider the following example. Example. Consider a team problem with two agents and observable actions, where agent ’saction A t does not affect the evolution of X t for all t , i.e. X t +1 = f t ( X t ,A t ,W t ) . Each agent i , i = 1 , , has an imperfect private observation of state X t at t given by Y it = O it ( X t , W it ) . Anarbitrary choice of strategy g it for agent i at t depends, in general, on his complete informationhistory given by { Y i t , A t − } . Therefore, following the agent-by-agent approach, if agent i ’sstrategy depends on A τ for some τ , ≤ τ ≤ t − , then agent j ’s, j (cid:54) = i , best responsealso depends on A τ . Consequently, the agent-by-agent approach fails to characterize A t − asirrelevant information for decision making purposes for agents and . However, using thesufficient information approach we can simply show that a globally optimal strategy profiledepend only on P { X t | Y i t ,A t } for agent i .2) Comparison with the Designer’s Approach: The designer’s approach was originally pro-posed by Witsenhausen in [25], and was further investigated in [35]. This approach considers thedecision problem from the point of view of a designer (she) who knows the system model and theprobability distribution of the primitive random variables, and chooses control/decision strategiesfor all agents; she chooses these strategies without having any observation/knowledge about therealizations of primitive random variables ( i.e. she chooses these strategies before the systemevolution starts). Therefore, the designer effectively solves a centralized panning problem. Thedesigner’s approach proceeds by: (i) formulating the centralized planning problem as a multi-stage, open-loop stochastic control problem in which the designer’s decision at each time is aset of control strategies for all agents; (ii) using the standard techniques in centralized stochasticcontrol to obtain a dynamic programming decomposition of the decision problem. Each step ofthe resulting dynamic program is a functional optimization problem. November 23, 2018 DRAFT4 The designer’s approach breaks the interdependencies between the agents’ decision and infor-mation over time by transferring all the complexity that arises due to non-classical informationstructure and signaling to a larger information state which at each time is given by a probabilitydistribution on H t , the domain of which increases with time as agents have perfect recall. There-fore, the sequential decomposition resulting from the designer’s approach is not, in general, verypractical for the study of multi-agent dynamic decision problems with asymmetric information.In contrast to the designer’s approach, the sufficient information approach provides a sequentialdecomposition of the decision problem over time where at each time t each agent makes decisionbased on only a compression of his information H it . Therefore, it leads to a dynamic programwhere the state variable at each step of the program is a probability distribution on S t insteadof a probability distribution on H t in the designer’s approach. 3) Comparison with the Common Information Approach: The common information approach,proposed in [18], [26], addresses some of the drawbacks of designer’s approach by modeling thedecision problem as a closed-loop centralized planning problem (POMDP) in which a coordinator observes perfectly the common information C t at each time t and, based on this knowledge,chooses a set of partial control strategies/prescriptions that determine how each agent takesan action based on his private information at time t . The coordinator’s information state attime t is his belief on ( X t , P t ) conditioned C t . As shown in [26], the dynamic programmingdecomposition achieved by the common information approach is simpler than that achieved bythe designer’s approach. In the common information approach the agents’ private informationremains intact. Therefore, the resulting decomposition is not very practical whenever the agents’private information grows in time (see special cases 1,3 and 4 in Section II). Furthermore, thecommon information approach becomes identical to the designer’s approach whenever the agentsdo not share any common information over time (see special case 3).In the sufficient information approach, we provide conditions sufficient to identify mutually-consistent compressions of the agents’ private information that are sufficient for decision makingpurposes and do not result in any loss in system performance. Thus, the sufficient informationapproach gives rise to a dynamic program that is simpler than the one resulting from the commoninformation approach. As we show in Section VII, these conditions are the core of sufficientinformation approach; they are generalized by Definition 3 to captures a mutually-consistent An instance where the domain of the control law is time-invariant is presented in [35]. November 23, 2018 DRAFT5 joint compressions of the agents’ private and common information. Moreover, in the model ofSection II, we do not assume that the agents share a common objective. Therefore, we do notreformulate the original multi-agent decision problem as a centralized planning problem from thecoordinator’s point of view when signaling occurs. Alternatively, we provide conditions sufficientto identify compression of the agents’ information in a mutually-consistent manner on individuallevel. As a result, our approach is applicable to both strategic and non-strategic settings (see ourcompanion paper [2] for strategic settings).VII. G ENERALIZATION In the sufficient information approach presented in Section IV, we treat the agents’ privateinformation and common information separately. This is because the main challenge in thestudy of dynamic decision problems with non-strategic agents is due to the presence of theagents’ private information. Nevertheless, such a separate treatment of private and commoninformation is not necessary. Using the same rationale that leads to Definition 2, we presentbelow a set of conditions sufficient to characterize a mutually consistent compression of agents’information, without separating private and common components, that is sufficient for decisionmaking purposes. Definition 3 (Sufficient information) . We say L it = ˜ ζ it ( P it , C t , g t − ) ∈ L it , i ∈ N , t ∈ T , issufficient information for the agents if,(i) it can be updated recursively as L it = ˜ φ it ( L it − , H it \ H it − , g t − ) for t ∈ T \{ } , (18) (ii) for any strategy profile g and for all realizations { c t , p t , p t +1 , z t +1 , a t } ∈ C t ×P t ×P t +1 ×Z t +1 with positive probability, P g t (cid:110) l t +1 (cid:12)(cid:12)(cid:12) p t ,c t ,a t (cid:111) = P g t (cid:110) l t +1 (cid:12)(cid:12)(cid:12) l t ,a t (cid:111) , (19) where l Nτ = ˜ ζ Nτ ( p Nτ ,c τ ; g τ − ) for τ ∈ T ;(iii) for every strategy profile ˜ g of the form ˜ g := { ˜ g it : L it → ∆( A it ) , i ∈ N ,t ∈ T } and a t ∈ A t , t ∈ T ; E ˜ g t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) c t ,p it ,a t (cid:111) = E ˜ g t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) l it ,a t (cid:111) , (20) November 23, 2018 DRAFT6 for all realizations { c t ,p it } ∈ C t ×P it of positive probability where l Nτ = ˜ ζ Nτ ( p Nτ ,c τ ; ˜ g τ − ) for τ ∈ T ;(iv) given an arbitrary strategy profile ˜ g of the form ˜ g := { ˜ g it : L it → ∆( A it ) , i ∈ N , t ∈ T } , i ∈ N , and t ∈ T , P ˜ g t − (cid:110) l − it (cid:12)(cid:12)(cid:12) p it ,c t (cid:111) = P ˜ g t − (cid:110) l − it (cid:12)(cid:12)(cid:12) l it (cid:111) , (21) for all realizations { c t ,p it } ∈ C t ×P it with positive probability where l Nτ = ˜ ζ Nτ ( p Nτ ,c τ ; ˜ g τ − ) for τ ∈ T . The conditions of Definition 3 are similar to those of Definition 2, but they concern agents’private and common information rather than just their private information. Throughout the paper,we do not make any assumption that the agents’ private observations are necessarily disjoint.Therefore, one can define P it = H it and C it = ∅ , for all i ∈ N and t ∈ T , in which caseDefinition 3 would be the same as Definition 2. Consequently, all the results appearing in thispaper (Theorems 1-6) also hold for sufficient information characterized by Definition 3.We show below that the set of information states ( S it , Π t ) , i ∈ N } proposed in Section IVsatisfies the conditions of Definition 3. Therefore, Definition 3 provides a generalization of thesufficient information approach presented in Section IV as it does not require to compress theagents’ private and common information separately. Theorem 4. The set of information states L it := ( S it , Π t ) , i ∈ N , t ∈ T , satisfies Definition 3. Compared to Definition 2, Definition 3 provides conditions sufficient for a mutually-consistentjoint compression of the agents’ private and common information. However, similar to thediscussion in Section VI-A, it does not provide a constructive algorithm to determine a setof sufficient information L it , i ∈ N , t ∈ T , with time-invariant domain. Remark 1. In view of Definition 3, one can replace condition (ii) of Definition 2 with a weakerone that requires that S t include all the information necessary to form a belief about therealizations (of parts) of Z t +1 only if (those parts of) Z t +1 affect the realization of Π t +1 given Π t . Using Definition 3 we identify a set of sufficient information for special case 4 described inSection II. November 23, 2018 DRAFT7 Special Case: 4) Optimal remote and local controller: We have C t = { Y t } , P t = { X t ,A t − }\ C t , and P t = { A t − } . Let τ ≤ t denote the last time the data transmission was successful between the localand remote controllers. We can restrict attention, without loss of optimality, to the class of purestrategies for both controllers. Therefore, one can show that L t = { X t , { P g { X t = x t | X ˆ τ } , ∀ x t ∈ X t }} and L t = { P g { X t = x t | X ˆ τ } , ∀ x t ∈ X t } satisfy the conditions of Definition 3; this is similar to thestructural results in [30], [31].VIII. E XTENSION TO I NFINITE H ORIZON In the model of Section II, we assume that the horizon T is finite. We present a model similarto that of Section II with infinite horizon, i.e. T = ∞ , and provide the extensions of our resultsto dynamic decision problems with infinite horizon. Infinite Horizon Dynamic Decision Problem: There are N non-strategic agents who livein a dynamic Markovian world over an infinite horizon. Consider a time-invariant model wherethe system state, actions, and observations spaces are finite and time-invariant, i.e. X ∞ = X t , A ∞ = A t , Z ∞ = Z t , and Y ∞ = Y t for all t ∈ N . Let X t ∈ X ∞ denote the system state at t ∈ N . Given the agents’ actions A t at t , the system state evolution is given by X t +1 = f ∞ ( X t , A t , W xt ) , (22)where { W xt , t ∈ N } is a sequence of independent and identically distributed random variables.The initial state X is a random variable with probability distribution η ∈ ∆( X ∞ ) with fullsupport that is common knowledge among the agents.At every time t ∈ N , each agent i ∈ N , receives a noisy observation Y it given by Y it = O i ∞ ( X t , A t − , W it ) , (23)where { W it , t ∈ N , i ∈ N } is a sequence of independent and identically distributed randomvariables.In addition, at every t ∈ N all agents receive a common observation Z t ∈ Z ∞ given by Z t = O c ∞ ( X t , A t − , W ct ) , (24) November 23, 2018 DRAFT8 where { W ct , t ∈ N } is a sequence of independent and identically distributed random variables;the sequences { W xt , t ∈ N } , { W ct , t ∈ N } , and { W it , t ∈ N , i ∈ N } and the initial state X aremutually independent.Similar to the model of Section II, let P it and C t denote agent i ’s, i ∈ N , private and commoninformation at t ∈ N , respectively. Agent i has a time-invariant instantaneous utility function δ t − u i ∞ ( X t , A t ) , and his total discounted utility is given by U i in ( X, A ) := ∞ (cid:88) t =1 δ t − u i ∞ ( X t , A t ) , (25)where δ denotes the discount factor.We provide an extension of our results to infinite horizon dynamic decision problems with non-strategic agents. For that matter, we first present a generalization of the definition of sufficientprivate information to infinite horizon decision problems. Definition 4 (Time-invariant sufficient private information) . We say S it , i ∈ N , t ∈ N , is atime-invariant sufficient private information if it is a sufficient private information and has atime-invariant domain denoted by S i ∞ , i ∈ N . We note that for the special cases presented in Section IV, the characterized sufficient privateinformation is time-invariant.Following an argument similar to the one presented in Section V, we extend the result ofTheorem 2 to infinite horizon dynamic decision problems with non-strategic agents. Theorem 5. Consider an infinite horizon dynamic decision problem with non-strategic agentshaving access to a public randomization device. Then, for any arbitrary strategy profile g thereexists an equivalent stationary SIB strategy profile σ that results in the same expected flow ofutility, i.e., E g (cid:40) ∞ (cid:88) τ = t δ t − u i ∞ ( g Nτ ( H Nτ ) ,X τ ) (cid:41) = E σ ∞ (cid:40) ∞ (cid:88) τ = t δ t − u i ∞ ( σ Nτ (Π τ ,S Nτ , ω τ ) ,X τ ) (cid:41) , (26) for all i ∈ N and t ∈ N . Next, we consider the case where agents share the same objective u i ∞ ( · , · ) = u team ∞ ( · , · ) forall i ∈ N ., i.e. an infinite horizon dynamic team problem. It is shown that in infinite horizon November 23, 2018 DRAFT9 POMDPS we can restrict attention, without loss of generality, to stationary Markov policies [3].We provide a generalization of this results to dynamic multi-agent decision problems below.Given a set of time-invariant sufficient private information, let Π t ∈ ∆( X ∞ ×S ∞ ) denote theSIB belief about ( X t , S t ) at time t . We call the mapping σ i ∞ : ∆( X ∞ × S ∞ ) × S i ∞ → ∆( A i ∞ ) a stationary SIB strategy for agent i if S it , i ∈ N , t ∈ N , is a time-invariant sufficient privateinformation. Similarly, given a stationary SIB strategy profile σ ∞ , we define a stationary SIBupdate rule as a time-invariant mapping η σ ∞ ∞ : ∆( X ∞ × S ∞ ) × Z ∞ → ∆( X ∞ × S ∞ ) , thatrecursively determines the SIB belief via Bayes’ rule for all t ∈ N . Similarly, let σ i ∞ (Π t , S it , ω t ) denote agent i ’s stationary SIB strategy using the public randomization device for every i ∈ N and t ∈ T , when the agents have access to a public randomization device ω t for every t ∈ T .We provide a sequential decomposition similar to that of Theorem 3 for infinite horizondynamic teams below. Theorem 6. A stationary SIB strategy profile σ ∞ is an optimal solution to an infinite horizondynamic team problem with asymmetric information if it solves the following Bellman equation: V ∞ ( π t ) := max α N : S N ∞ →A N ∞ E π (cid:8) u team ∞ ( X t ,α N ( S Nt )) + V ∞ ( η ∞ ( π t ,α N ,Z t +1 )) (cid:9) , (27) for all π t ∈ ∆( X ∞ × S ∞ ) . The result of Theorem 6 provide a generalization of Bellman equation for POMDPS (see [3,Ch. 8]) to decision problems with many agents and asymmetric information.IX. C ONCLUSION We presented a general approach to study a general class of dynamic multi-agent decisionmaking problems with non-strategic agents. We proposed the notion of sufficient informationthat enables us to compress effectively the agents’ (private and common) information in amutually consistent manner for decision making purposes. We showed that the restriction tothe class of SIB strategies are without loss of generality. Accordingly, we provided a sequentialdecomposition of dynamic decision problems with non-strategic agents, and formulated a dy-namic program to determine a globally optimal strategy profile in dynamic teams. The proposedsufficient information approach presented in this paper generalizes a set of existing results inthe literature for the study of dynamic multi-agent decision making problems with non-strategicagents. Our results in this paper, along with those appearing in the companion paper [2], provide November 23, 2018 DRAFT0 a unified appraoch to study dynamic decision problems with non-strategic agents (teams) andstrategic agents (games). For future directions, we will investigate the problem of determining aconstructive algorithm that enables us to identify sufficient (private) information in a systematicway. R EFERENCES [1] H. Tavafoghi, Y. Ouyang, and D. Teneketzis, “A sufficient information approach to decentralized decision making,” in , 2018.[2] H. Tavafoghi, Y. Ouyang, and D. Teneketzis, “A unified approach to dynamic multi-agent decision problems withasymmetric information - part i: Strategic agents,” working paper , 2018.[3] P. Kumar and P. Varaiya, Stochastic Systems: Estimation Identification and Adaptive Control . Prentice-Hall, Inc., 1986.[4] D. P. Bertsekas, Dynamic Programming and Optimal Control , vol. 1. Belmont, MA: Athena Scientific, 1995.[5] A. Mahajan, N. C. Martins, M. C. Rotkowitz, and S. Y¨uksel, “Information structures in optimal decentralized control,” in , pp. 1291–1306, 2012.[6] A. A. Kulkarni and T. P. Coleman, “An optimizer’s approach to stochastic control problems with nonclassical informationstructures,” IEEE Transactions on Automatic Control , vol. 60, no. 4, pp. 937–949, 2015.[7] L. Lessard and S. Lall, “Convexity of decentralized controller synthesis,” IEEE Transactions on Automatic Control , vol. 61,no. 10, pp. 3122–3127, 2016.[8] S. Y¨uksel and N. Saldi, “Convex analysis in decentralized stochastic control and strategic measures,” in , pp. 6050–6055, 2016.[9] H. S. Witsenhausen, “A counterexample in stochastic optimum control,” SIAM Journal of Optimal Control , vol. 6, no. 1,pp. 131–147, 1968.[10] Y.-C. Ho and K.-C. Chu, “Team decision theory and information structures in optimal control problems–part i,” IEEETransactions on Automatic Control , vol. 17, no. 1, pp. 15–22, 1972.[11] A. Lamperski and J. C. Doyle, “On the structure of state-feedback lqg controllers for distributed systems withcommunication delays,” in , pp. 6901–6906, 2011.[12] L. Lessard and A. Nayyar, “Structural results and explicit solution for two-player LQG systems on a finite time horizon,”in , pp. 6542–6549, 2013.[13] P. Shah and P. Parrilo, “ H -optimal decentralized control over posets: A state-space solution for state-feedback,” vol. 58,pp. 3084–3096, Dec. 2013.[14] A. Nayyar and L. Lessard, “Structural results for partially nested LQG systems over graphs,” in American ControlConference (ACC), 2015 , pp. 5457–5464, 2015.[15] L. Lessard and S. Lall, “Optimal control of two-player systems with output feedback,” IEEE Transactions on AutomaticControl , vol. 60, no. 8, pp. 2129–2144, 2015.[16] S. Yuksel, “Stochastic nestedness and the belief sharing information pattern,” IEEE Transactions on Automatic Control ,vol. 54, no. 12, pp. 2773–2786, 2009.[17] Y. Ouyang, S. M. Asghari, and A. Nayyar, “Stochastic teams with randomized information structures,” in , 2017. November 23, 2018 DRAFT1 [18] A. Nayyar, A. Mahajan, and D. Teneketzis, “Optimal control strategies in delayed sharing information structures,” IEEETransactions on Automatic Control , vol. 56, no. 7, pp. 1606–1620, 2011.[19] H. Witsenhausen, “Separation of estimation and control for discrete time systems,” Proceedings of the IEEE , vol. 59,no. 11, pp. 1557–1566, 1971.[20] P. Varaiya and J. Walrand, “On delayed sharing patterns,” IEEE Transactions on Automatic Control , vol. 23, no. 3, pp. 443–445, 1978.[21] T. Yoshikawa, “Decomposition of dynamic team decision problems,” IEEE Transactions on Automatic Control , vol. 23,no. 4, pp. 627–632, 1978.[22] M. Rotkowitz and S. Lall, “A characterization of convex problems in decentralized control,” IEEE Transactions onAutomatic Control , vol. 50, no. 12, pp. 1984–1996, 2005.[23] S. M. Asghari and A. Nayyar, “Dynamic teams and decentralized control problems with substitutable actions,” 2016.[24] Y. Ho, “Team decision theory and information structures,” Proceedings of the IEEE , vol. 68, no. 6, pp. 644–654, 1980.[25] H. S. Witsenhausen, “A standard form for sequential stochastic control,” Mathematical Systems Theory , vol. 7, no. 1,pp. 5–11, 1973.[26] A. Nayyar, A. Mahajan, and D. Teneketzis, “Decentralized stochastic control with partial history sharing: A commoninformation approach,” IEEE Transactions on Automatic Control , vol. 58, no. 7, pp. 1644–1658, 2013.[27] H. Witsenhausen, “On the structure of real-time source coders,” The Bell System Technical Journal , vol. 58, no. 6, pp. 1437–1451, 1979.[28] B. Kurtaran, “Corrections and extensions to” decentralized stochastic control with delayed sharing information pattern”,” IEEE Transactions on Automatic Control , vol. 24, no. 4, pp. 656–657, 1979.[29] A. Nayyar and D. Teneketzis, “On the structure of real-time encoding and decoding functions in a multiterminalcommunication system,” IEEE Transactions on Information Theory , vol. 57, no. 9, pp. 6196–6214, 2011.[30] Y. Ouyang, S. Asghari, and A. Nayyar, “Optimal local and remote controllers with unreliable communication,” in , pp. 6024–6029, 2016.[31] S. M. Asghari, Y. Ouyang, and A. Nayyar, “Optimal local and remote controllers with unreliable uplink channels,” IEEETransactions on Automatic Control , forthcoming.[32] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of markov decisionprocesses,” Mathematics of operations research , vol. 27, no. 4, pp. 819–840, 2002.[33] A. Mahajan and M. Mannan, “Decentralized stochastic control,” Annals of Operations Research , vol. 241, no. 1-2, pp. 109–126, 2016.[34] S. Y¨uksel and T. Bas¸ar, Stochastic Networked Control Systems: Stabilization and Optimization under InformationConstraints . Springer Science & Business Media, 2013.[35] A. Mahajan and D. Teneketzis, “Optimal design of sequential real-time communication systems,” IEEE Transactions onInformation Theory , vol. 55, no. 11, pp. 5317–5338, 2009. November 23, 2018 DRAFT2 A PPENDIX Proof of Theorem 1 . We prove the result of part (i) by induction. For t = 1 the result holdssince the agents have not taken any action yet. Suppose that (12) holds for t − . Then, P g (cid:8) x t ,h − it | h it (cid:9) = (cid:88) x t − P g (cid:8) x t , x t − , h − it | h it (cid:9) = (cid:88) x t − P g (cid:8) x t , x t − , h − it − , a − it − , y − it | h it − , a it − , y it , z t (cid:9) = (cid:88) x t − P { y − it | x t ,a t − } P g (cid:8) x t ,x t − ,h − it − ,a − it − | h it − ,a it − ,y it ,z t (cid:9) = (cid:88) x t − (cid:104) P { y − it | x t ,a t − } P { x t | x t − ,a t − } P g (cid:8) x t − ,h − it − ,a − it − | h it − ,a it − ,y it ,z t (cid:9) (cid:105) = (cid:88) x t − (cid:104) P { y − it | x t , a t − } P { x t | x t − , a t − } g − it − ( h − it − )( a − it − ) P g (cid:8) x t − , h − it − | h it − , a it − , y it , z t (cid:9)(cid:105) = (cid:88) x t − (cid:104) P { y − it | x t , a t − } P { x t | x t − , a t − } g − it − ( h − it − )( a − it − ) P g (cid:8) x t − , h − it − , y it , z t | h it − , a it − (cid:9) P g (cid:8) y it , z t | h it − , a it − (cid:9) (cid:105) . (28)Consider the term P g (cid:8) x t − , h − it − , y it , z t | h it − , a it − (cid:9) in the nominator of the expression above. Wehave, P g (cid:8) x t − , h − it − , y it , z t | h it − , a it − (cid:9) = (cid:88) a − it − ,x t (cid:104) P { y it , z t | x t , a − it − , a it − } P { x t | x t − , a − it − , a it − } g − it − ( h − it − )( a − it − ) P g (cid:8) x t − , h − it − | h it − , a it − (cid:9)(cid:105) = (cid:88) a − it − ,x t (cid:104) P { y it , z t | x t , a − it − , a it − } P { x t | x t − , a − it − , a it − } g − it − ( h − it − )( a − it − ) P g − i (cid:8) x t − , h − it − | h it − , a it − (cid:9)(cid:105) = P g − i (cid:8) x t − , h − it − , y it , z t | h it − , a it − (cid:9) (29)where the second equality follows from the induction hypothesis (12) for t − . Consequently,we also have, P g (cid:8) y it ,z t | h it − ,a it − (cid:9) = (cid:88) ˆ h − it − , ˆ x t P g (cid:110) y it , z t , ˆ x t − , ˆ h − it − | h it − , a it − (cid:111) by (29) = (cid:88) ˆ h − it − , ˆ x t P g − i (cid:110) y it ,z t , ˆ x t − , ˆ h − it − | h it − ,a it − (cid:111) = P g − i (cid:8) y it ,z t | h it − ,a it − (cid:9) (30) November 23, 2018 DRAFT3 Substituting (29) and (30) in (28), P g (cid:8) x t ,h − it | h it (cid:9) = (cid:88) x t − (cid:104) P { y − it | x t , a t − } P { x t | x t − , a t − } g − it − ( h − it − )( a − it − ) P g − i (cid:8) x t − ,h − it − ,y it ,z t | h it − ,a it − (cid:9) P g − i (cid:8) y it , z t | h it − , a it − (cid:9) (cid:105) = P g − i (cid:8) x t ,h − it | h it (cid:9) which establishes the induction step for t .Given the result of part (i), the result of part (ii) follows directly from the definition of SIBstrategies (10) and SIB update rule (11).To provide the proof for Theorem 2, we need the following result. Lemma 1. Given a SIB strategy profile σ and update rule ψ consistent with σ , P σψ { s t +1 , π t +1 | p t , c t , a t } = P σψ { s t +1 , π t +1 | s t , π t , a t } , (31) for all s t +1 , π t +1 , s t , π t , a t . Proof of Lemma 1 . Let g σ denote the strategy profile, given by (10), that corresponds to SIBstrategy profile σ . We have, P σψ { s t +1 ,π t +1 | p t ,c t ,a t } π t = γ t ( c t ) = P σψ { s t +1 ,π t +1 | p t ,c t ,a t ,π t } using update rule (11) = (cid:88) z t +1 (cid:104) P σψ { s t +1 , z t +1 | p t , c t , a t , π t } { π t +1 = ψ t +1 ( π t ,z t +1 ) } (cid:105) by (6) = (cid:88) z t +1 (cid:104) P σψ { s t +1 , z t +1 | s t , c t , a t , π t } { π t +1 = ψ t +1 ( π t ,z t +1 ) } (cid:105) = (cid:88) y t +1 ,x t +1 ,x t ,z t +1 (cid:104) P σψ { s t +1 , z t +1 , y t +1 , x t +1 , x t | s t , c t , a t , π t } { π t +1 = ψ t +1 ( π t ,z t +1 ) } (cid:105) by system dynamics (1) and (2) = November 23, 2018 DRAFT4 (cid:88) y t +1 ,x t +1 ,x t ,z t +1 (cid:104) P σψ { s t +1 | s t , c t , a t , π t , z t +1 , y t +1 , x t +1 , x t } P { z t +1 , y t +1 | a t , x t +1 } P { x t +1 | x t , a t } P σ { x t | s t , c t , a t , π t } { π t +1 = ψ t +1 ( π t ,z t +1 ) } (cid:105) by (5) = (cid:88) y t +1 ,x t +1 ,x t ,z t +1 (cid:104)(cid:16) (cid:89) j { s jt +1 = φ jt +1 ( s jt , { y jt +1 ,z t +1 ,a jt } ; g σ ) } (cid:17) P { z t +1 , y t +1 | a t , x t +1 } P { x t +1 | x t , a t } P σψ { x t | s t , c t } { π t +1 = ψ t +1 ( π t ,z t +1 ) } (cid:105) by Bayes’ rule = (cid:88) y t +1 ,z t +1 ,x t +1 ,x t (cid:104)(cid:16) (cid:89) j { s jt +1 = φ jt +1 ( s jt , { y jt +1 ,z t +1 ,a jt } ; g σ ) } (cid:17) P { z t +1 , y t +1 | a t , x t +1 } P { x t +1 | x t , a t } P σψ { x t , s t | c t } P σψ { s t | c t } { π t +1 = ψ t +1 ( π t ,z t +1 ) } (cid:105) = (cid:88) y t +1 ,z t +1 ,x t +1 ,x t (cid:104)(cid:16) (cid:89) j { s jt +1 = φ jt +1 ( s jt , { y jt +1 ,z t +1 ,a jt } ; g σ ) } (cid:17) P { z t +1 , y t +1 | a t , x t +1 } P { x t +1 | x t , a t } π t ( x t , s t ) (cid:80) ˆ x t π t (ˆ x t , s t ) { π t +1 = ψ t +1 ( π t ,z t +1 ) } (cid:105) = P σψ { s t +1 , π t +1 | s t , π t , a t } . Proof of Theorem 2. Consider an arbitrary strategy profile g . We prove the existence of SIBstrategy profile that is equivalent to g by construction.With some abuse of notation, let σ i (Π t , S it , ω t ) denote agent i ’s strategy using the publicrandomization device ω t . We construct a SIB strategy profile σ t that has the following properties:(a) the induced distribution on { Π t +1 , S t +1 } under σ coincides with one under g , i.e. P σ t { π t +1 , s t +1 } = P g t { π t +1 , s t +1 } . (32)(b) the continuation payoff for all the agents under σ is the same as that under g , i.e. for all November 23, 2018 DRAFT5 i ∈ N , E g (cid:40) T (cid:88) τ = t u iτ ( X τ , g τ ( H τ )) (cid:41) = E σ (cid:40) T (cid:88) τ = t u iτ ( X τ , σ τ (Π τ , S τ , ω τ )) (cid:41) . (33)We prove condition (a) by forward induction and condition (b) by backward induction. Wenote that condition (a) is satisfied for t = 1 , since at t = 1 no action has been taken. Moreover,condition (b) is satisfied for t = T + 1 since there is no future.Assume that condition (a) is satisfied from to t , t ∈ T . We construct σ t below such thatcondition (a) is satisfied at t +1 .To construct σ t , we first define below a random vector R Nt based on H Nt , such that forevery i ∈ N , (i) R Nt is independent of Π t and S Nt , and (ii) H it can be reconstructed using R it along with Π t and S it .We proceed as follows. For every time t ∈ T , let ( π t , s Nt ) denote the realization of the agents’sufficient common information and private information, respectively. Let H it := { h i, t , ..., h i, |H it | t } denote the set of all histories of agent i at time t , where |H it | denote the number of possi-ble realizations of agent i ’s history at time t . Conditioned on the realization of ( π t , s it ) , let { p ( h i,kt | π t , s it ) , ≤ k ≤ | H it | ) } denote the probability mass function on H it that leads to ( π t , s it ) for agent i . Define the random variable R it on [0 , as follows:1) P (cid:110) ≤ R it ≤ p ( h i, t | π t ,s it ) (cid:111) = p ( h i, t | π t ,s it ) , (34)and conditioned on the event (cid:110) ≤ R it ≤ p ( h i, t | π t ,s it ) (cid:111) , R it is uniformly distributed on [0 ,p ( h i, t | π t ,s it )] .2) For < k ≤ |H it | , P (cid:110) k − (cid:88) j =1 p ( h i,jt | π t ,s it ) ≤ R it ≤ k (cid:88) j =1 p ( h i,jt | π t ,s it ) (cid:111) = p ( h i,kt | π t ,s it ) , (35)and conditioned on the event (cid:110)(cid:80) k − j =1 p ( h i,jt | π t ,s it ) ≤ R it ≤ (cid:80) kj =1 p ( h i,jt | π t ,s it ) (cid:111) , R it is uniformlydistributed on (cid:104)(cid:80) k − j =1 p ( h i,jt | π t ,s it ) , (cid:80) kj =1 p ( h i,jt | π t ,s it ) (cid:105) .Therefore, R it is uniformly distributed on [0 , and is independent of (Π t , S it ) . Furthermore, November 23, 2018 DRAFT6 for any realization ( π t , s it , r it ) we can uniquely determine h i,lt where l := min { k : r it ≥ k − (cid:88) j =1 p ( h i,jt | π t , s it ) } . (36)Therefore, the random variable R it defined above, satisfies the mentioned-above conditions (i)and (ii) when H it takes finite values.We show below that R it is independent of S t . Lemma 2. The random variable R it , i ∈ N , is independent of Π t and S t for all t ∈ T .Proof of Lemma 2. Consider an arbitrary realization ( h t ,..., h Nt ) of ( H t ,..., H Nt ) . Let (( s t ,π t , r t ) ,..., ( s Nt ,π t , r Nt )) denote the realization of (( S t , Π t , R t ) , ..., ( S Nt , Π t , R Nt )) where ( s it , π t , r iT ) cor-responds to h it as it is defined above for every i ∈ N .For every i ∈ N we have, P g { r it | π t , s t } = P g { r t | π t , s it , s − iT } = P g { r it , s − it | π t , s it } P g { s − it | π t , s it } = P g { s − it | r it , π t , s it } P g { r it | π t , s it } P g { s − it | π t , s it } replace ( π t , s it , r it ) by h it = P g { s − it | h it } P g { r it | π t , s it } P g { s − it | π t , s it } (37)The last equality holds because H it is uniquely determined by (Π t , S it , R it ) and vice versa; see(34)-(36). Moreover, P g { s − it | h t } by (8) = P g { s − it | s t , c t } = P g { s − it , s it | c t } P g { s it | c t } = π gt ( s − it , s it ) (cid:80) ˆ s − it π gt (ˆ s − it , s it ) = P { s − it | s t , π gt } . (38)Combining (37) and (38) P g { r it | π t , s t } = P g { s − it | s it , π gt } P g { r it | π t , s it } P g { s − it | π t , s it } = P g { r it | π t , s it } = P g { r it } (39)where the last equality is true since by definition R it is independent of (Π t , S it ) . Therefore, by(39), R it is independent of Π t and S t for all i ∈ N .Using the result of Lemma 2, we prove that for every i ∈ N , (i) R Nt is independent of Π t and S Nt , and (ii) H it can be reconstructed using R it along with Π t and S it .In the following, we construct a SIB strategy profile σ t equivalent to g t as follows. Let ˆ R Nt ( ω t ) denote a random vector the agents construct using the public randomization device ω t that has November 23, 2018 DRAFT7 an identical joint cumulative distribution to that of R Nt . Note that by Lemma 2, the distributionof R Nt is independent of S t and Π t .Define, σ it (Π t , S it , ω t ) := g it ( F − R it | S it , Π t ( ˆ R it ( ω t ) , Π t , S it )) . (40)Then, P g t { π t +1 , s t +1 | H t } = P g t { π t +1 , s t +1 | Π t , S t , R t } distribution = P g t { π t +1 , s t +1 | Π t , S t , ˆ R t } = P σ t { π t +1 , s t +1 | Π t , S t , ˆ R t } . Taking the expectation of the left and right hand sides with respect to ω t and R t , respectively,and using the fact that ˆ R t ( ω t ) and R t are independent of S t and Π t (Lemma 2), we obtain P σ t { π t +1 , s t +1 | Π t , S t } = P g t { π t +1 , s t +1 | Π t , S t } w.p. . (41)By the induction hypothesis, we have P σ t − { π t , s t } = P g t − { π t , s t } . Therefore, taking theexpectation of both sides of (41) with respect to Π t , S t , we establish that condition (a) holds fortime t + 1 .Next, we prove condition (b) by backward induction. We have, E g { u it ( X t , A t ) | H t } = E g { u it ( X t , A t ) | Π t , S t , R t } distribution = E g { u it ( X t , A t ) | Π t , S t , ˆ R t } = E σ { u it ( X t , A t ) | Π t , S t , ˆ R t } (42)Using (42) for t = T , we have condition (b) is satisfied for t = T .Now we assume that condition (b) is satisfied from t + 1 to T , t ∈ T . We prove that condition(b) is satisfied at t .Using condition (a) at time t , i.e P σ t − { s t , π t } = P g t − { s t , π t } , the induction hypothesis oncondition (b) for t + 1 along with equation (42) for t , and the fact that R t and ˆ R t are identicallydistributed and independent of Π t and S t , we obtain E g (cid:40) T (cid:88) τ = t u iτ ( X τ , g τ ( H τ )) (cid:41) = E σ (cid:40) T (cid:88) τ = t u iτ ( X τ , σ τ (Π τ , S τ , ω τ )) (cid:41) . November 23, 2018 DRAFT8 Proof of Theorem 3 . By the result of Theorem 2, we can restrict attention to SIB strategieswith public randomization device without loss of generality. Moreover, since by Assumption 1all space are finite, we can restrict attention to SIB strategies (with no public randomizationdevice) without loss of generality. The proof of Theorem 3 then follows from an argumentidentical to the one given for dynamic programming for POMDP (see [3, Ch. 6.7]).The dynamic program described by (15-17) can be viewed as a solution to the followingdecision problem that is equivalent to the original dynamic team problem. Consider a “ superagent ” that knows the functional forms of system dynamics and the agents’ utilities, and the setof spaces X t , A Nt , S Nt for all t . The super agent coordinates the agents’ decisions at each timeas follows. The super agent observes π t (which is common knowledge among all agents) butdoes not know the realizations s Nt of the agents’ sufficient private information. Based on hisinformation, the super agent chooses a joint set of prescriptions/partial functions σ Nt ( π t , · ) , onefor each agent, that determine agent i ’s action for every realization s it as σ it ( π t , s it ) for ∀ t, i . Thedynamic program described by (15-17) determines an optimal solution for the above-describedsuper agent, and thus, equivalently, determine the optimal strategy for the original dynamic teamproblem. Proof of Theorem 4 . We show below that L it := ( S it , Π t ) , i ∈ N , t ∈ T satisfies conditions(i)-(iv) of Definition 3.Condition (i) is satisfied since both S Nt and Π t can be updated recursively via update rules φ Nt and ψ t , respectively, for every t ∈ T .Condition (ii) is satisfied by Lemma 1.To prove condition (iii), we have P { x t | c t , s t } = P { x t , s t | c t } (cid:80) ˆ x t P { ˆ x t , s t | c t } = P { x t | π t , s t } . (43)Therefore, The above interpretation of the dynamic program from the point of view of a super agent is similar to the coordinatorproblem formulated in [18], [26]. November 23, 2018 DRAFT9 E ˜ g − i t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) c t ,p it ,a t (cid:111) by (7) = E ˜ g − i t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) c t ,s it ,a t (cid:111) = E ˜ g − i t − (cid:40) E ˜ g − i t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) X t ,a t (cid:111) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c t , s it , a t (cid:41) by (43) = E ˜ g − i t − (cid:40) E ˜ g − i t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) X t ,a t (cid:111) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) π t , s it , a t (cid:41) = E ˜ g − i t − (cid:110) u it ( X t ,A t ) (cid:12)(cid:12)(cid:12) π t ,p it ,a t (cid:111) Condition (iv) holds since, P ˜ g − i t − , ˜ g i t − (cid:110) l − it (cid:12)(cid:12)(cid:12) p it ,c t (cid:111) = P ˜ g − i t − (cid:110) s − it (cid:12)(cid:12)(cid:12) p it ,c t (cid:111) by (7) = P ˜ g − i t − (cid:110) s − it (cid:12)(cid:12)(cid:12) s it ,c t (cid:111) = P ˜ g − i t − { s t | c t } P ˜ g − i t − ,g i t − { s it | c t } by (43) = P ˜ g − i t − (cid:110) s − it (cid:12)(cid:12)(cid:12) s it ,π t (cid:111) = P ˜ g − i t − (cid:110) l − it (cid:12)(cid:12)(cid:12) l it (cid:111) Proof of Theorem 5 . Consider the SIB strategy σ t constructed in the proof of Theorem 2 forevery t ∈ N . We show below that σ t satisfies (26).By the proof of Theorem 2, condition (32) holds for all t ∈ N . To prove (26), we show thatunder strategy σ t , t ∈ N , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E g (cid:40) ∞ (cid:88) τ = t δ t − u i ∞ ( g Nτ ( H Nτ ) ,X τ ) (cid:41) − E σ ∞ (cid:40) ∞ (cid:88) τ = t δ t − u i ∞ ( σ N ∞ (Π τ ,S Nτ ) ,X τ ) (cid:41) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (44)for all (cid:15) > .Let M = max a t ,x t ,i | u i ∞ ( x t , a t ) | . For every (cid:15) > , choose T ∈ N such that δ T − δ M ≤ (cid:15) . Then,for any arbitrary strategy ˜ g , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E ˜ g (cid:40) ∞ (cid:88) τ = T δ t − u i ∞ (˜ g Nτ ( H Nτ ) ,X τ ) (cid:41) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) . (45)Therefore, for every t < T , condition (44) is satisfied by (45) and the triangle inequality . November 23, 2018 DRAFT0