Dynamic Games among Teams with Delayed Intra-Team Information Sharing
Dengwang Tang, Hamidreza Tavafoghi, Vijay Subramanian, Ashutosh Nayyar, Demosthenis Teneketzis
aa r X i v : . [ c s . M A ] F e b Dynamic Games among Teams with DelayedIntra-Team Information Sharing
Dengwang Tang, Hamidreza Tavafoghi, Vijay Subramanian, Ashutosh Nayyar, Demosthenis Teneketzis
Abstract —We analyze a class of stochastic dynamic gamesamong teams with asymmetric information, where members ofa team share their observations internally with a delay of d .Each team is associated with a controlled Markov Chain, whosedynamics are coupled through the players’ actions. These gamesexhibit challenges in both theory and practice due to the presenceof signaling and the increasing domain of information over time.We develop a general approach to characterize a subset of NashEquilibria where the agents can use a compressed version oftheir information, instead of the full information, to choose theiractions. We identify two subclasses of strategies: Sufficient PrivateInformation Based (SPIB) strategies, which only compress privateinformation, and Compressed Information Based (CIB) strate-gies, which compress both common and private information. Weshow that while SPIB-strategy-based equilibria always exist, thesame is not true for CIB-strategy-based equilibria. We developa backward inductive sequential procedure, whose solution (if itexists) provides a CIB strategy-based equilibrium. We identifysome instances where we can guarantee the existence of a solutionto the above procedure. Our results highlight the tension amongcompression of information, existence of (compression based)equilibria, and backward inductive sequential computation ofsuch equilibria in stochastic dynamic games with asymmetricinformation. I. INTRODUCTIONDynamic games with asymmetric information appear inmany socioeconomic contexts. In these games, multipleagents/decision makers interact repeatedly in a changing en-vironment. Agents have different information and seek tooptimize their respective long-term payoffs. For example,multiple companies may compete with each other in a marketover time and each company attempts to optimize its own long-term benefits [1]–[5]; the market is also changing over timedriven by the actions the companies take. Another instance ofsuch games arises in cyberphysical systems [6]–[10]; at eachtime, attackers make decisions on which hosts to attack, andthe system administrators/defenders choose actions to defendagainst the attackers, for example, by isolating some hostsfrom the rest of the system [6]; the system’s state changesover time as a result of the attackers’ and defenders’ actions.
This work is supported by NSF Grant No. ECCS 1750041, ECCS 2038416,ECCS 1608361, CCF 2008130, ARO Award No. W911NF-17-1-0232, andMIDAS Sponsorship Funds by General Dynamics.D. Tang, V. Subramanian, and D. Teneketzis are with Electrical andComputer Engineering, University of Michigan, Ann Arbor, MI, 48109,USA. E-mail: [email protected], [email protected],[email protected] .H. Tavafoghi is with Mechanical Engineering, University of California,Berkeley, CA, 94720, USA. E-mail: [email protected] .A. Nayyar is with the Ming Hsieh Department of Electrical and ComputerEngineering, University of Southern California, Los Angeles, CA, 90089,USA. E-mail: [email protected] . In all instances of these games, when an agent takes an action,she needs to consider not only how the action will affecther current payoff but also how it will influence the system’sevolution and the future actions of all agents, and hence herfuture payoffs.In some settings, agents can form groups, or teams [11],[12]. The agents in the same group share a common goalbut may have different information available to them. Thisinformation asymmetry among teammates appears in manyengineering applications. In most of these applications, thestate of the system changes fast, and agents have to make real-time decisions. Moreover, the communication between agentsis either costly, or restricted by bandwidth or delay. Examplesof our settings include competing fleets of automated cars fromrival companies [13] and the DARPA Spectrum Challenge[14]. In the DARPA Spectrum challenge setup, individualtransceivers work in teams to maximize the sum throughput oftheir networks. Teams compete with other teams, and membersof the same team need to coordinate and evolve their responsesover time. In these settings, agents in the same team aim tochoose their strategy jointly to achieve team optimality (i.e. tochoose the joint strategy profile that maximizes the expectedutility of the team over all joint strategy profiles) rather thanjust person-by-person optimality (a team strategy is person-by-person optimal, PBPO, when each team member’s strategy isan optimal response to other team members’ strategy profile).We study a stylized model of such settings in this paper.It is worth stating that the games among teams problemswe focus on in this paper are different from cooperativegames in economics research (e.g. see [15] Chapters 8-10).In cooperative game theory, the goal is to study the groupformation process among agents with different objectives. Inour setting, groups are assumed to be fixed and given, and wefocus instead on determining the optimal actions and payoffsfor each group. A unilateral deviation in our problems meansone or more agents in one group deviates, but the communitystructure of the agents stays the same.There are three main challenges that need to be addressedwhen studying dynamic games among individual players: (i)the agents’ decisions and information are interdependent overtime. In particular, signaling is present in these games, i.e.agents actively infer other agents’ private information basedon their actions and their strategy; (ii) the domain of theagents’ strategy grows over time; (iii) signaling in games ismore challenging and subtle than in team problems due tothe diverging incentives of the agents. Games of teams inheritall the above challenges. Moreover, we have the additionalchallenge of coordination within asymmetrically informedeam members to achieve team optimality instead of person-by-person optimality.In this paper we propose a general approach to characterizea subset of equilibrium strategies of dynamic games amongteams with the following goals: (i) to determine appropriatecompression of information for each agent to base theirdecision on; (ii) to develop a sequential decomposition of thegame. In addition we would like to determine conditions suffi-cient to guarantee the existence of such equilibrium strategies.
A. Related Literature
To understand games among teams, we first examine ateam’s best-response strategy when other teams’ strategies arefixed. Team problems, or decentralized control problems, havebeen extensively studied in the control literature. Researchershave developed various methodologies/approaches to decen-tralized control problems to determine team optimal strategiesor PBPO strategies, and to determine structural results/proper-ties for the above mentioned strategies. These methodologiesinclude: (i) the person-by-person approach [16]–[30] (ii) thedesigner’s approach [31], [32] (iii) the coordinator’s approach[33]–[36]. The person-by-person approach has been used todetermine qualitative/structural properties of team optimal orPBPO strategies. In this approach, the strategies of all teammembers/agents except one, say agent i , are assumed to bearbitrary but fixed; then the qualitative properties of agent i ’s best response strategy are determined. These propertiesare then valid for all possible (fixed) strategies of the otheragents. The designer’s approach investigates the decentralizedcontrol/team problem from the point of view of a designerwho knows the system model and the joint probability dis-tributions of the primitive random variables (the system’sinitial state, the noise driving the system, and the noise in theagents’ observations). The designer chooses the strategies ofall team members at time 0 by solving an open-loop stochasticcontrol problem, where her decision at each time is thestrategy/control law for all the team members/agents. Applyingstochastic control results, the designer can obtain a dynamicprogramming decomposition. The methodology developed inthis paper is inspired by the coordinator’s approach usedin [33], [34], [36]. Similar to the designer’s approach, thecoordinator’s approach assumes that a fictitious agent, calledthe coordinator, assigns instructions to agents. However, unlikethe designer’s approach, the coordinator is assumed to knowthe common information of all agents, and assigns partialstrategies (prescriptions) instead of full strategies to agents.The partial strategies tell an agent how to utilize her privateinformation to generate actions. Both the designer’s approachand the coordinator’s approach lead to the determination ofglobally optimal team strategy profiles.Research on dynamic games roughly consists of two di-rections: One direction focuses on repeated games or multi-stage games, where the instantaneous payoffs at each stage isonly affected by actions in this stage but not by the actions inthe previous stages. In these games, researchers investigatedlong term interactions among agents (e.g. punishment andreward strategies) and characterized the set of equilibrium payoffs (e.g. see [37] or [15] Chapter 7). The other directionfocuses on games with an underlying dynamic system, in otherwords, games where instantaneous payoffs can be affected byprevious actions. In this more complicated setting, researchersattempted to develop methodologies for the determinationof equilibria with either a general structure or a specializedstructure. In this paper we focus on the latter direction.Games of individual agents (i.e. agents do not form teams)with an underlying dynamic system have been studied in boththe economics and the control literature. Dynamic games withsymmetric information have been studied extensively [38],[39]. In [48], the authors propose the concept of MarkovPerfect Equilibrium (MPE) for the case where the state ofthe system and agents’ actions are perfectly observable. Theresearch on dynamic games with asymmetric informationcan be classified into two categories: zero-sum games andgeneral (i.e. not necessarily zero-sum) games. Zero-sum gamesare analyzed in [40]–[47]. In these works, the authors takeadvantage of many properties of zero-sum games, such as hav-ing a unique value and the interchangeability of equilibriumstrategies. These properties do not extend to general non-zero-sum games. The literature on general dynamic games includes[49]–[58]. In [54], the authors extend the MPE concept in [48]to the case where the underlying dynamics is only partiallyobservable. Under the crucial assumption that the commoninformation based (CIB) belief is strategy-independent, theauthors prove that there exist equilibria where agents play CIB strategies , i.e. the agents choose their actions based onCIB belief and private information instead of full information.Furthermore, such equilibria can be found through a sequentialdecomposition of the game. In our setup the system state isnot perfectly observed, thus our model is distinctly differentfrom that of [48]. Furthermore, in contrast to [54], the CIBbelief in our model is strategy-dependent.The closest work to our paper in terms of both modeland approach is [55]. In [55], the authors consider a gamemodel where, in contrast to [54], the CIB beliefs are strategy-dependent. They propose the concept of Common InformationBased Perfect Bayesian Equilibrium (CIB-PBE) as a solutionconcept for this game model and prove that CIB-PBE canbe found through a sequential decomposition whenever thisdecomposition has a solution. The game model of [55] hasmultiple features that prevent us from directly applying theirresults in our analysis in Section V. We will make a moredetailed comparison in Section III. Our work is also closein spirit to [49]. In [49], the authors extend their work in[48] by considering games where actions are observable buteach agent has a fixed, private utility type. They proposeMarkov Sequential Equilibrium (MSE) as a solution conceptfor these games, where the agents choose their actions basedon a compression of their information along with their beliefson the types of other agents. The authors show by examplethat MSE do not necessarily exist. As an alternative to MSEthey propose a new concept obtained from limits of ε -MSE as ε goes to 0.Unlike either team problems or dynamic games amongindividual agents, games among teams (in particular, ones withan underlying dynamic system) have not been systematicallytudied in the literature. There are only a few works on specialmodels of games among teams. In [59] and [60], the authorsproposed algorithms to compute equilibria for zero-sum mul-tiplayer extensive form games, where a team of players playsagainst an adversary. In [61] the authors provide an example ofa zero-sum game which involves a team. However the playersin this team have symmetrical information, hence the team isequivalent to an individual player with vector-valued actions.In [50] the authors briefly extend their results in [54] to gamesamong teams for a specialized model where the CIB belief isstrategy independent. In both [11] and [12] the authors solve atwo-team zero-sum linear quadratic stochastic dynamic game.In [62] the authors formulate and solve a game between twoteams of mobile agents. The model and information structureof [62] are different from ours. Additionally, games amongteams have been the subject of empirical research (see, forexample, [63], [64]). In our work, we study analytically amodel of non zero-sum dynamic stochastic games amongteams where the CIB belief is strategy dependent. B. Contribution
In this paper, we consider a model of dynamic games amongteams with asymmetric information. We assume that each teamis associated with a dynamical system that has Markoviandynamics driven by the actions of all agents of all teams.The state of each dynamical system is assumed to be vector-valued, where each component represents an agent’s localstate. Agents can observe their own local states perfectly andcommunicate them within their respective teams with a delayof d . All actions are public, i.e., observable by every agentin every team. We also assume the presence of public noisyobservations of the system’s state. The instantaneous rewardof a team depends on the states and actions of all teams. Ourmodel is a generalization of the model in [55] to competingteams.Our contributions are as follows: • We identify appropriate compression of information foreach agent. The compression is achieved in two steps:(i) the compression of team-private information that de-pends only on the team strategy; (ii) the compression ofcommon information that depends on the strategy of allagents. The compression steps induce two special classesof strategies: (i) Sufficient Private Information Based(SPIB) strategies, where agents only apply the first stepof compression (ii) Compressed Information Based (CIB)strategies, where agents apply both steps of compression. • We develop a sequential decomposition of the gamewhere agents play CIB strategies. We show that anysolution of the sequential decomposition forms a NashEquilibrium of the game. • We show that SPIB-strategy-based Nash Equilibria al-ways exist, while CIB-strategy-based Nash Equilibria donot always exist. We identify some simple instanceswhere CIB-strategy-based equilibria are guaranteed toexist.In a broader context, our results highlight the conflicts be-tween compression of information, sequential decomposition, and existence of equilibria that occur in a wide range ofdynamic games with asymmetric information, reiterating themessage in [49]: In general, compression can hurt the abilityto sustain equilibria, since the full history can allow for a finercalibration of the agents’ strategies.
C. Organization
We organize the rest of the paper as follows: In Section IIwe formally present our model and problem. In Section IIIwe transform the game among teams into an equivalent gameamong coordinators where each coordinator represents a team.In Section IV we introduce our first step of compression ofinformation and SPIB strategies, and we show the existenceof SPIB-strategy-based equilibria. In Section V we introducethe second step of compression and CIB strategies, and weprovide a sequential decomposition of the game. We also showthe general non-existence of CIB-strategy-based equilibriaand provide some conditions for existence. We present someextensions and special cases of our results in Section VI. Thenwe discuss our results in Section VII. We conclude in SectionVIII. Proof details are provided in the Appendix.
D. Notation
We use capital letters to represent random variables, boldcapital letters to denote random vectors, and lower case lettersto represent realizations. We use superscripts to indicate teamsand agents, and subscripts to indicate time. We use i torepresent a typical team, and − i represents all teams otherthan i . We use t : t to indicate the collection of timestamps ( t , t + 1 , · · · , t ) . For example X stands for the randomvector ( X , X , X , X ) . For random variables or randomvectors, we use the corresponding script capital letters (italiccapital letters for greek letters) to denote the space of valuesthese random vectors can take. For example, H it denotes thespace of values the random vector H it can take. The productsof sets in this paper are Cartesian products. We use P ( · ) and E [ · ] to denote probabilities and expectations, respectively. Weuse ∆( Ω ) to denote the set of probability distributions ona finite set Ω . When writing probabilities, we will omit therandom variables when the lower case letters that represent therealizations clearly indicates the random variable it represents.For example, we will use P ( y it | x t , u t ) as a shorthand for P ( Y it = y it | X t = x t , U t = u t ) . When λ is a functionfrom Ω to ∆( Ω ) , with some abuse of notation we write λ ( ω | ω ) := ( λ ( ω ))( ω ) as if λ is a conditional distribution.We use A to denote the indicator random variable of an event A .In general, probability distributions of random variables ina dynamic system are only well defined after a completestrategy profile is specified. We specify the strategy profilethat defines the distribution in superscripts, e.g. P g ( x it | h t ) .When the conditional probability is independent of a certainpart of the strategy ( g it ) ( i,t ) ∈ Ω , we may omit this part ofthe strategy in the notation, e.g. P g t − ( x t | y t − , u t − ) , P g i ( x it | h t ) or P ( x t +1 | x t , u t ) . We say that a realization of somerandom vector (for example h t ) is admissible under a partiallyspecified strategy profile (for example g − i ) if the realizationas strictly positive probability under some completion ofthe partially specified strategy profile (In this example, thatmeans P g i ,g − i ( h t ) > for some g i ). Whenever we write aconditional probability or conditional expectation, we implic-itly assume that the condition has non-zero probability underthe specified strategy profile. When only part of the strategyprofile is specified in the superscript, we implicitly assume thatthe condition is admissible under the specified partial strategyprofile. II. PROBLEM FORMULATION A. System Model and Information Structure
We consider a finite horizon dynamic game among finitelymany teams each consisting of a finite number of agents, whereagents have asymmetric information. Let I = { , · · · , I } denote the set of teams and T = { , · · · , T } denote the set oftime indices. We use a tuple ( i, j ) to indicate the j -th memberof team i . For a team i ∈ I , let N i = { ( i, , · · · , ( i, N i ) } denote team i ’s members. Let N = S i ∈I N i denote the setof all agents. At each time t ∈ T , each agent ( i, j ) selectsan action U i,jt ∈ U i,jt , where U i,jt denotes the action space ofagent ( i, j ) at time t . Each team is associated with a vector-valued dynamical system X it = ( X i,jt ) ( i,j ) ∈N i which evolvesaccording to X it +1 = f it ( X it , U t , W i,Xt ) , i ∈ I , where U t = ( U k,jt ) ( k,j ) ∈N , and ( W i,Xt ) i ∈I ,t ∈T is the noisein the dynamical system. We assume that X i,jt ∈ X i,jt for ( i, j ) ∈ N .We assume that the actions of all agents are publiclyobserved. Further, at time t , after all the agents take actions,a public observation of team i ’s state is generated accordingto Y it = ℓ it ( X it , U t , W i,Yt ) , i ∈ I , where Y it ∈ Y it , and ( W i,Yt ) i ∈I ,t ∈T are the observation noises.The order of events occuring between time steps t and t + 1 is shown in the figure below: t X it U i,jt Y it t + 1 X it +1 We assume that the functions ( f it ) i ∈I ,t ∈T , ( ℓ it ) i ∈I ,t ∈T arecommon knowledge among all agents. We further assume that ( X i ) i ∈I , ( W i,Xt ) i ∈I ,t ∈T , and ( W i,Yt ) i ∈I ,t ∈T are mutuallyindependent primitive random variables whose distributionsare also common knowledge among all agents. As a result, theteams’ dynamic ( X it ) t ∈T , i ∈ I are conditionally independentgiven the actions, and the public observations of differentteams’ systems are conditionally independent given the statesand actions of all teams.At each time t , the following information is available to allagents: H t = ( Y t − , U t − ) , where Y t = ( Y it ) i ∈I , U t = ( U i,jt ) ( i,j ) ∈N . We refer to H t asthe common information among teams. We assume that each agent ( i, j ) observes her own state X i,jt . Further, agents in the same team share their states witheach other with a time delay d ≥ . Thus, at time t , all agentsin team i have access to H it , given by H it = ( Y t − , U t − , X i t − d ) , i ∈ I . We call H it the common information within team i .Finally, the information available to agent ( i, j ) at time t ,denoted by H i,jt , is H i,jt = ( Y t − , U t − , X i t − d , X i,jt − d +1: t ) , ( i, j ) ∈ N . This model captures the hierarchy of information asymme-try between teams and team members.
Remark 1.
Our model also captures the scenarios where ateam has only one member. Such a team can be incorporatedin our framework by adding a dummy agent to it and assuminga suitable internal communication delay d . If all teams aresingle-member teams, then d can be arbitrarily chosen.To illustrate the key ideas of the paper without dealingwith technical difficulties arising from continuum spaces, weassume that all the system random variables (i.e. all states,actions, and observations) take values in finite sets. Assumption 1. X i,jt , Y it , U i,jt are finite sets for all ( i, j ) ∈N , t ∈ T . Remark 2.
For convenience of notation and proofs, for t ∈{− ( d − , · · · , − , } , we define X i,jt = U i,jt = Y it = { } and r it ( X t , U t ) = 0 for all i ∈ N and ( i, j ) ∈ N . B. Strategies and Reward Functions
For games among teams, there are three possible typesof team strategies one could consider: (1) pure strategies,i.e. deterministic strategies; and (2) randomized strategieswhere team members independently randomize; (3) random-ized strategies where team members jointly randomize.A pure strategy profile of a team is a collection of func-tions µ i = ( µ i,jt ) ( i,j ) ∈N i ,t ∈T , where µ i,jt : H i,jt
7→ U i,jt .Define M i,jt as the space of functions from H i,jt to U i,jt . Let M i = Q t ∈T Q ( i,j ) ∈N i M i,jt . Any randomized strategy of ateam, either of type 2 or type 3, can be described through amixed strategy σ i ∈ ∆( M i ) . In particular, if team membersindependently randomize, the mixed strategy σ i being used todescribe the strategy profile will be a product of measures on M i,j = Q t ∈T M i,jt for ( i, j ) ∈ N i .Team i ’s total reward under a pure strategy profile µ =( µ i,jt ) ( i,j ) ∈N ,t ∈T is J i ( µ ) = E µ "X t ∈T r it ( X t , U t ) , where the functions ( r it ) i ∈I ,t ∈T , r it : X t × U t R , represent-ing the instantaneous rewards, are common knowledge amongall agents. Team i ’s total reward under a mixed strategy profile σ = ( σ i ) i ∈I , σ i ∈ ∆( M i ) , is then an average of the totalrewards under pure strategy profiles, i.e. J i ( σ ) = X µ ∈M Y i ∈I σ i ( µ i ) ! J i ( µ ) . ote that while members of the same team may jointlyrandomize their strategies, different teams’ strategy profilesare always assumed to be independent. C. Solution Concept
In this work, a team refers to a group of agents that haveasymmetric information and the same objective. Because ofthe shared objective, members of the same team can jointlydecide on the strategy to use before the start of the game forthe collective benefit of the team. Hence, we can assume thatevery member of the team knows the strategy of the others inthe team. Therefore, when considering an equilibrium concept,we should consider team deviations rather than individualdeviations, i.e. multiple members of the same team may decideto play a different strategy than the equilibrium strategy. Weconsider randomized strategies where team members jointlyrandomize. Example 1 of Section II-C1 illustrates why suchstrategies must be considered when we study games amongteams.The above discussion motivates the definition of a TeamNash Equilibrium. Definition 1 (Team Nash Equilibrium) . A mixed strategyprofile σ ∗ = ( σ ∗ i ) i ∈I , σ ∗ i ∈ ∆( M i ) , is said to form a TeamNash Equilibrium (TNE) if J i ( σ ∗ i , σ ∗− i ) ≥ J i (˜ σ i , σ ∗− i ) for any mixed strategy profile ˜ σ i ∈ ∆( M i ) for all i ∈ I .To implement an arbitrary mixed strategy, a team canchoose their strategy profile jointly before the game starts.For example, the team can jointly choose a random strategyprofile out of an ensemble of profiles according to somedistribution. Alternatively, they can agree on a protocol toutilize a commonly observed randomness source to randomizetheir strategies in a correlated manner in real time.The primary objective of this paper is to characterize asubclass of Team NE and to devise a backward inductivesequential computation procedure to determine these TeamNE.
1) A Motivating Example:
The following example illus-trates the importance of considering jointly randomized mixedstrategies when we study games among teams. Similar to therole mixed strategies play in games among individual players,the space of jointly randomized mixed strategies contains theminimum richness of strategies that ensures an equilibriumexists in games among teams. In particular, if we restrictthe teams to use independently randomized strategies, i.e.type 1 and type 2 strategies described in Section II-B, thenan equilibrium may not exist. This example is similar tothe examples in [59]–[61] in spirit, despite the fact that inour example the players in the same team have asymmetricinformation. We first focus on Nash Equilibrium since it is the simplest and broadestsolution concept for games. We will show (in Section VI-B) that in somecases, our solution satisfy some notion of sequential rationality for gamesamong teams.
Example 1 (Guessing Game) . Consider a two-stage game (i.e. T = { , } ) of two teams I = { A, B } , each consistingof two players. The set of all agents is given by N = { ( A, , ( A, , ( B, , ( B, } . Let X At = ( X A, t , X A, t ) ∈{− , } and Team B does not have a state, i.e. X Bt = ∅ .Assume U i,jt = {− , } for t = 1 , i = A or t = 2 , i = B and U i,jt = ∅ otherwise, i.e. Team A moves at time , and Team Bmoves at time . At time , X A, and X A, are independentlyuniformly distributed on {− , } . Team A’s system is assumedto be static, i.e. X A = X A .The rewards of Team A are given by r A ( X , U ) = { X A, U A, X A, U A, = − } ,r A ( X , U ) = − { X A, = U B, } − { X A, = U B, } , and the rewards of Team B are given by r B ( X , U ) = 0 ,r B ( X , U ) = { X A, = U B, } + { X A, = U B, } . Assume that there are no additional common observationsother than past actions, i.e. Y t = ∅ . We set the delay d = 2 ,i.e. agent (A, 1) does not know X A, t throughout the game anda similar property is true for agent (A, 2). In this game, thetask of Team A is to choose actions according to their statesat t = 1 in order to earn a positive reward, while not revealingtoo much information through their actions to Team B. Thetask of Team B is to guess Team A’s state.It can be verified (see Appendix A for a detailed derivation)that if we restrict both teams to use independently randomizedstrategies (including deterministic strategies), then there existsno equilibria. However, there does exist an equilibrium whereTeam A randomizes in a correlated manner, specifically, thefollowing strategy profile σ ∗ : At t = 1 , Team A plays γ A =( γ A, , γ A, ) with probability 1/2, and ˜ γ A = (˜ γ A, , ˜ γ A, ) withprobability 1/2, where γ A, ( x A, ) = x A, , γ A, ( x A, ) = − x A, , ˜ γ A, ( x A, ) = − x A, , ˜ γ A, ( x A, ) = x A, and at t = 2 , the two members of Team B choose independentand uniformly distributed actions on {− , } , independent oftheir action and observation history. In σ ∗ , each agent ( A, j ) chooses a uniform random action irrespective of their states. Itis important to have ( A, and ( A, choose these actions in acorrelated way to ensure that they obtain the full instantaneousreward while not revealing any information.III. GAME OF COORDINATORSIn this section we present a game among individual playersthat is equivalent to the game among teams formulated inSection II.We view the agents of a team as being coordinated bya fictitious coordinator as in [34]: At each time t , team i ’s coordinator instructs the members of team i how to usetheir private information H i,jt \ H it , based on H it and herpast instructions up to time t − (see [34]). Using thisvantage point, we can view the games among teams as gamesamong coordinators, where the coordinators’ actions are thenstructions, or prescriptions , provided to individual agents.Notice that unlike agents’ actions, coordinators’ actions (pre-scriptions) cannot be publicly observed. To proceed further weformally define coordinators’ actions and strategies, and proveLemma 1. Definition 2 (Prescription) . Coordinator i ’s prescriptions attime t is a collection of functions γ it = ( γ i,jt ) ( i,j ) ∈N i where γ i,jt : X i,jt − d +1: t
7→ U i,jt .Define Γ i,jt to be the space of functions that maps X i,jt − d +1: t to U i,jt . Define Γ it = Q ( i,j ) ∈N i Γ i,jt . Definition 3 (Pure Coordination Strategy) . Define the aug-mented team-common information of team i to be H it =( H it , Γ i t − ) , where Γ i t − are past prescriptions assignedby the coordinator of team i . A pure coordination strategyof team i is a collection of mappings ν i = ( ν it ) t ∈T where ν it : H it Γ it .The next lemma establishes the equivalence between purecoordination strategies and pure strategies of a team. Lemma 1.
For every pure coordination strategy profile ν ,there exists a pure strategy profile µ that yields the samepayoffs for all teams and vice versa.Proof. See Appendix B.Based on the above lemma, we can immediately concludethat a mixed strategy profile is equivalent to a mixed co-ordination strategy (i.e. a distribution on the space of purecoordination strategy profiles). As a result, Team Nash Equi-libria, as defined in Section II-C, will be equivalent to NashEquilibria of coordinators, where the coordinators can usemixed coordination strategies.Therefore, we can transform the games among teams togames among individual players, where each player is a(team) coordinator whose actions are prescriptions. Followingthe standard approach in game theory, we now considerbehavioral strategies of the individuals in this lifted game,i.e. the coordinators, since unlike mixed strategies, behavioralstrategies allow for independent randomizations accross timeand therefore better facilitate a sequential decomposition ofthe dynamic game.
Definition 4 (Behavioral Coordination Strategy) . A behavioralcoordination strategy of team i is a collection of mappings g i = ( g it ) t ∈T where g it : H it ∆( Γ it ) .Given that the coordinators have perfect recall, that is, at anytime t , the coordinator remembers all her observations up totime t , and all her “actions” (prescriptions) up to time t − ,we can conclude from Kuhn’s theorem [65] that behavioralcoordination strategies are equivalent to mixed coordinationstrategies in the following sense. Lemma 2.
For any behavioral coordination strategy profile,there exists a mixed coordination strategy profile with the sameexpected payoffs and vice versa.
Based on this equivalence we can first define Nash Equilib-ria for the coordinator’s game and then restate our objectivefrom Section II-C.
Definition 5 (Coordinators’ Nash Equilibrium) . For any be-havioral coordination strategy profile g , define J i ( g ) = E g "X t ∈T r it ( X t , U t ) . A behavioral coordination strategy profile g ∗ =( g ∗ it ) i ∈I ,t ∈T where g ∗ it : H it ∆( Γ it ) is said to from a Coordinator’s NashEquilibrium (CNE) if J i ( g ∗ i , g ∗− i ) ≥ J i (˜ g i , g ∗− i ) for any behavioral coordination strategy profile ˜ g i : H it ∆( Γ it ) for any team i ∈ I , i.e. the behavioral strategies ofcoordinators form a Bayes-Nash Equilibrium in the game ofcoordinators.Given that we have lifted the game among teams to agame among coordinators, we adjust the terminology for theinformation structure accordingly. From now on, we will referto the common information among all teams (i.e. H t ) assimply the common information , while the information thatmembers of team i share but is not known to other teams(i.e. H it \ H t = ( X i t − d , Γ i t − ) ) will be referred to as the private information of coordinator i . The information thatis private to an agent (i.e. X i,jt − d +1: t ) will be referred to as hidden information , since none of the coordinators observethis information. Remark 3.
The games among coordinators we obtain has afew differences from the game model in [55]: • Actions in [55] are publicly observable. However, asmentioned before, in our game among coordinators, the“actions” (prescriptions) of the coordinators are privateinformation. • The local state X it in [55] is perfectly observable byplayer i without delay. However, in our game amongcoordinators, at time t , a coordinator can only observeher local state up to time t − d . • The transitions of local states in [55] are conditionallyindependent given the actions, i.e. P ( x t +1 | x t , u t ) = Q i P ( x it +1 | x it , u t ) . However, in our game among coordi-nators, transition of local states are not independent giventhe prescriptions. • The public observation process of local states in [55]is conditionally independent given the actions, i.e. P ( y t | x t , u t ) = Q i P ( y it | x it , u t ) . However, in our gameamong coordinators, public observations of local statesare not independent given the prescriptions.Due to the above differences, we cannot directly apply theresults of [55] to the game of coordinators. A. An Illustrative Example
The following example illustrates how to visualize gamesamong teams from the coordinators’ viewpoint. xample 2.
Consider a variant of the Guessing Game inExample 1 with the same system model and informationstructure but different action sets and reward functions. In thenew game, Team A moves at both t = 1 and t = 2 , with U A,jt = {− , } for t = 1 , and j = 1 , . Team B movesonly at time t = 2 as in the original game. The new rewardfunctions are given by r A ( X , U ) = 0 ,r A ( X , U ) = { X A, = U A, ,X A, = U A, } + { X A = U B } ,r B ( X , U ) = 0 ,r B ( X , U ) = { X A = U B } . In this example, Team A’s task is to guess its own state after around of publicly observable communication while not leakinginformation to Team B.A Team Nash Equilibrium ( σ ∗ A , σ ∗ B ) of this game is asfollows: Team A chooses one of the four pure strategy profileslisted below with equal probability: • µ A, ( x A, ) = − x A, , µ A, ( x A, ) = x A, ,µ A, ( u , x A, ) = u A, , µ A, ( u , x A, ) = − u A, ; • µ A, ( x A, ) = − x A, , µ A, ( x A, ) = − x A, ,µ A, ( u , x A, ) = − u A, , µ A, ( u , x A, ) = − u A, ; • µ A, ( x A, ) = x A, , µ A, ( x A, ) = x A, ,µ A, ( u , x A, ) = u A, , µ A, ( u , x A, ) = u A, ; • µ A, ( x A, ) = x A, , µ A, ( x A, ) = − x A, ,µ A, ( u , x A, ) = − u A, , µ A, ( u , x A, ) = u A, ; while Team B choose U B uniformly at random independentof U . In words, from Team B’s point of view, Team Achooses U A to be a uniform random vector independent of X A . However the randomization is done in a coordinatedmanner: Before the game starts, both members of team Arandomly draw a card from two cards, where one card says“lie” and the other says “tell the truth.” Both players then telleach other what card they have drawn before the game starts.At time t = 1 , both players in Team A play the strategyindicated by their cards. At time t = 2 , Team A can thenperfectly recover X A from U A and the knowledge about thestrategy being used at t = 1 .Now we describe Team A’s equilibrium strategy by theequivalent coordinator A’s behavioral strategy. Use ng todenote the prescription that maps − to and to − . Use id to denote the identity map prescription, i.e. the prescriptionthat maps − to − and to . Use cp b to denote theconstant prescription that always instruct individuals to play b ∈ {− , } . The mixed strategy profile σ ∗ A is equivalent tothe following behavioral coordination strategy: At time t = 1 , g A ( ∅ ) ∈ ∆( Γ A, × Γ A, ) satisfies g A ( ∅ )( γ A, , γ A, ) = 14 ∀ γ A, , γ A, ∈ { ng , id } . At time t = 2 , g A : U A, × U A, × Γ A, × Γ A, ∆( Γ A, × Γ A, ) is a deterministic strategy that satisfies g A ( u , u , ng , id ) = DM ( cp u , cp − u ) , g A ( u , u , ng , ng ) = DM ( cp − u , cp − u ) ,g A ( u , u , id , id ) = DM ( cp u , cp u ) ,g A ( u , u , id , ng ) = DM ( cp − u , cp u ) , where DM : Γ A, × Γ A, ∆( Γ A, × Γ A, ) represents thedelta measure. In words, the coordinator of Team A randomlychooses one of all four possible prescription profiles at time t = 1 . At time t = 2 , based on the observed action andthe prescriptions chosen before, the coordinator of Team Adirectly assign actions to agents to instruct them to recoverthe state from the actions at t = 1 . Note that the behavioralcoordination strategy at t = 2 depends explicitly on the pastprescription Γ A in addition to the realization of past actions.This is because the coordinator needs to remember not onlythe agents’ actions, but also the rationale behind those actionsin order to interpret the signals sent through the actions.IV. COMPRESSION OF PRIVATE INFORMATIONIn this section, we identify a subset of a coordinator’sprivate information that is sufficient for decision-making forthe game of coordinators formulated in Section III. We referto this subset of private information as the Sufficient PrivateInformation (SPI) for this coordinator. We restrict attention toSufficient Private Information Based (SPIB) strategies, wherecoordinators choose prescriptions based on their sufficientprivate information along with the common information. As aresult, the coordinators do not need full recall to play SPIBstrategies. We show that there always exist a Coordinator’sNash Equilibrium where coordinators play SPIB strategies.As a result, the restriction to SPIB strategies does not hurtthe existence of equilibria.We proceed as follows. We first present a structural resultthat plays an important role in the subsequent analysis. Wethen introduce our results gradually in separate sections basedon the value of d , the delay in information sharing within thesame team. We treat the cases d = 1 and d > separately. Thisis since when d = 1 the equilibrium strategies we obtain aresimpler than those under d > . For d > , we introduce thenotion of Partially Realized Prescriptions (PRP) and use themto construct a subset of private information that is sufficientfor decision-making. We then define the notion of SufficientPrivate Information (SPI) and Sufficient Private InformationBased (SPIB) strategies to unify the results for d = 1 and d > . Finally, we show that CNEs where coordinators playSPIB strategies always exist. A. A Preliminary Result
We show that the states and prescriptions of differentcoordinators are conditionally independent given the commoninformation.
Lemma 3 (Conditional Independence) . Under any behavioralcoordination strategy profile g and for each time t ∈ T , ( X k t , Γ k t ) k ∈I are conditionally independent given the com-mon information H t . Furthermore, the conditional distribu-tion of ( X k t , Γ k t ) depends on g only through g k .Proof. See Appendix C.s a result of Lemma 3, coordinator i ’s estimation of othercoordinators’ state and prescriptions is independent of herown strategy and private information. In other words, whilecoordinator i has access to both the common information andher private information, her belief on the other coordinators’private information (history of states and prescription) is solelybased on the common information. B. Result for d = 1 While coordinator i ’s private information consists of ( X i t − , Γ i t − ) , she does not have to use all of it to form abest response. Lemma 4.
Under d = 1 , for any behavioral coordinationstrategy profile g − i of all coordinators other than i , thereexists a best response behavioral coordination strategy g i forcoordinator i that chooses randomized prescriptions basedsolely on ( H t , X it − ) .Proof. Deferred to the proof of Lemma 6.Lemma 4 shows that the coordinators can ignore much oftheir private information without compromising their objec-tive.
C. Result for d > Similar to d = 1 , we now identify a subset of privateinformation for d > case that is sufficient for decision-making.Recall that coordinator i ’s information at time t consists of H it = ( Y t − , U t − , X i t − d ) . To choose her prescriptionsat time t , coordinator i needs to estimate her hidden infor-mation (i.e. X it − d +1: t ). When d = 1 , the belief on hiddeninformation is simply constructed using ( X it − , U t − ) andthe knowledge of the transition probabilities of the underlyingsystem. However, when d > , more information in additionto ( X it − d , U t − d : t − ) is needed to form the belief.To illustrate this, we start with the case d = 2 . When d = 2 , the belief of coordinator i on her hidden informationwould depend on the last prescription Γ it − in addition to ( X it − , U t − t − ) . This is due to the signaling effect ofthe action U it − : since coordinator i knows U it − , she caninfer something about X it − through the prescription used toproduce these actions (recall that U i,jt − = Γ i,jt − ( X i,jt − t − ) for ( i, j ) ∈ N i ). Hence at time t , coordinator i needs to take Γ it − into account when making a decision.Furthermore, for d = 2 , when making a decision attime t , coordinator i can use a compressed version of theprescription Γ it − instead of Γ it − itself. This is because attime t , coordinator i has learned X it − that she didn’t knowat time t − . The coordinator can then focus on the followingessential question: what is the relationship between X it − and U it − , given the knowledge of X it − ?Similarly, for a general d > , to estimate the hiddeninformation, each coordinator needs to utilize her past ( d − prescriptions. Again, a coordinator can use a compressed ver-sion of the past ( d − prescriptions, since she can incorporatethe additional information she knows at time t that she did not know back when the prescriptions were chosen. Eachcoordinator can now focus on the relationship between theunknown states and the known actions, given what is alreadyknown. This motivates the definition of ( d − -step PRPs. Definition 6.
The ( d − -step partially realized prescriptions (PRPs) for coordinator i at time t is a collection of functions Φ it := (Φ i,jt − l,l ) ( i,j ) ∈N i , ≤ l ≤ d − , where Φ i,jt − l,l = Γ i,jt − l ( X i,jt − l − d +1: t − d , · ) is a function from X i,jt − d +1: t − l to U i,jt − l .PRPs have smaller dimension than prescriptions. To illus-trate this point, consider the case d = 2 : A prescription γ i,jt can be represented as a table, where the rows represent x i,jt − ∈ X i,jt − , the columns represent x i,jt ∈ X i,jt , and theentries represent the corresponding action u i,jt = γ i,jt ( x i,jt − t ) to take. On the other hand, the 1-step partially realizedprescription φ i,jt, = γ i,jt ( x i,jt − , · ) can be represented by onerow of the table of γ i,jt chosen based on the realization of X i,jt − .In addition to ( X it − d , U t − d : t − , Φ it ) , coordinator i alsoneeds to use Y it − d +1: t − to form a belief on her hiddeninformation since Y it − d +1: t − can provide additional insighton X it − d +1: t − that ( X it − d , U t − d : t − , Φ it ) cannot necessarilyprovide. The belief coordinator i has on her hidden informa-tion is summarized in the following lemma. Lemma 5.
Suppose that the behavioral coordination strategyprofile g = ( g i ) i ∈I is being played. Then the conditionaldistribution of X it − d +1: t given H it under g can be expressedas a fixed function of ( Y it − d +1: t − , U t − d : t − , X it − d , Φ it ) , i.e. P g ( x it − d +1: t | h it )= P it ( x it − d +1: t | y it − d +1: t − , u t − d : t − , x it − d , φ it ) ∀ h it ∈ H it (1) for some function P it independent of g .Proof. See Appendix D.
Remark 4.
The above result can be interpreted in the follow-ing way: X it − d is perfectly observed, hence coordinator i candiscard X i t − d − which are irrelevant information due to theMarkov property. Since X it − d +1: t − are not perfectly observedby coordinator i , every public observation and action basedupon X it − d +1: t − are important to coordinator i since it canhelp in estimating the state X it − d +1: t − . Note that Φ it encodesthe essential information coordinator i needs to rememberat time t about her previous signaling strategy: how does X it − d +1: t − (unknown) map to U it − d +1: t − (known)? Withthis piece of information, coordinator i can fully interpret thesignals sent through U it − d +1: t − .We claim that while coordinator i ’s private informationconsists of ( X i t − d , Γ i t − ) , coordinator i only needs to use ( X it − d , Φ it ) along with the common information to chooseprescriptions. The ( d − -step PRPs are the same as the partial functions defined in thesecond structural result in [33]. emma 6. Given an arbitrary d > , for any behavioralcoordination strategy profile g − i of all coordinators other than i , there exists a best response behavioral coordination strategy g i for coordinator i that chooses randomized prescriptionsbased solely on ( H t , X it − d , Φ it ) .Proof. See Appendix E.
Remark 5.
Lemmas 5 and 6 and their proofs also apply to d = 1 , in which case the ( d − -step PRP Φ it is empty bydefinition.From now on, we unify the results for d = 1 and d > . Weformally define the Sufficient Private Information (SPI) andSPIB strategies which will be used in the rest of the paper. Definition 7 (Sufficient Private Information) . For a given d > , the Sufficient Private Information (SPI) for coordinator i attime t is defined as S it = ( X it − d , Φ it ) . Definition 8 (Sufficient Private Information Based Strategy) . A Sufficient Private Information Based (SPIB) strategy forcoordinator i is a collection of functions ρ i = ( ρ it ) t ∈T , ρ it : H t × S it ∆( Γ it ) .It can be easily verified that S it can be sequentially updated,i.e., there exists a fixed function ι it such that S it +1 = ι it ( S it , X it − d +1 , Γ it ) . (2)Therefore, a coordinator does not need full recall to play aSPIB strategy. D. Coordinators’ Nash Equilibrium in SPIB Strategies and itsExistence
Since the coordinators have perfect recall, we know fromstandard results for dynamic games that a CNE, as definedin Definition 5, exists (see Chapter 11 of [66], for example).However, in those CNEs, coordinators do not necessarily playSPIB strategies, hence the standard arguments that guaranteethe existence of CNE cannot be used to establish the existenceof CNE in SPIB strategies. Moreover, SPIB strategies do notfeature full recall, hence one cannot directly apply standardarguments to obtain the existence of CNE in SPIB strategies.An SPIB strategy profile ρ = ( ρ it ) i ∈I ,t ∈T , ρ it : H t × S it ∆( Γ it ) is called a Sufficient Private Information Based Coordi-nators’ Nash Equilibrium (SPIB-CNE) if ρ , seen as a profileof behavioral coordination strategies, forms a Coordinator’sNash Equilibrium. Theorem 1.
There exist at least one SPIB-CNE for thedynamic game among coordinators.Proof.
See Appendix F.V. COMPRESSION OF COMMON INFORMATIONAND SEQUENTIAL DECOMPOSITIONThe SPIB strategies defined in the previous section usesufficient private information instead of the entire privateinformation for each coordinator. If the sets X t , Y t , U t aretime-invariant, the set of possible values of sufficient privateinformation used in SPIB strategies is also time-invariant. However, the common information still increases with timeand this means that the domain of SPIB strategies keepsincreasing with time. In order to limit the growing domainof SPIB strategies, we introduce a subclass of SPIB strate-gies, named Compressed Information Based (CIB) strategies,where the coordinators use a compressed version of com-mon information instead of the entire common information.We show that this new class of strategies satisfies a keybest-response/closedness property. Based on this property weprovide a backward inductive procedure that identifies anequilibrium in this subclass of strategies if each step of thisprocedure has a solution. While equilibria in CIB strategiesmay not exist in general (see example in Section V-E), weidentify classes of games among teams where such equilibriado exist. A. Compressed Common Information and CIB Strategy
In decentralized control problems [34], [36] and gamesamong individuals [55], [56], agents can compress their com-mon information into beliefs on hidden and (sufficient) privateinformation for the purpose of decision-making. Similarly,we would like to consider a subclass of SPIB strategieswhere each coordinator compresses the common information H t to a belief on sufficient private information and hiddeninformation, i.e. P ( X kt − d : t = · , Φ kt = ·| H t ) for k ∈ I . Due toLemma 5, these beliefs can be constructed from P ( X kt − d = · , Φ kt = ·| H t ) and ( Y kt − d +1: t − , U t − d : t − ) . Therefore, wewill consider strategies where coordinators use common in-formation based beliefs on the sufficient private information S kt = ( X kt − d , Φ kt ) k ∈I along with the uncompressed values of ( Y t − d +1: t − , U t − d : t − ) , instead of the whole H t .We formalize the above discussion in the rest of thissubsection. Definition 9 (Belief Generation System) . A Belief GenerationSystem for coordinator i consists of a sequence of functions ψ i = ( ψ i,kt ) k ∈I ,t ∈T where ψ i,kt : (cid:0)Q l ∈I ∆( S lt ) (cid:1) × Y t − d +1: t ×U t − d : t ∆( S kt +1 ) Coordinator i can use this system to generate commoninformation based beliefs Π i,kt ∈ ∆( S kt ) for all k ∈ I asfollows: • Π i,k is the prior distribution of ( X k − ( d − , Φ k ) , i.e. a mea-sure which assigns probability 1 to the event ( X k − ( d − =0 , Φ k = ˆ φ k ) , where ˆ φ k is the PRP that always producesactions u k,jt = 0 for all ( k, j ) ∈ N k , t ≤ (see Remark2); • Π i,kt +1 = ψ i,kt ((Π i,lt ) l ∈I , Y t − d +1: t , U t − d : t ) , t ≥ . Π i,kt represents coordinator i ’s subjective belief on coordinator k ’s sufficient private information S kt . These beliefs alongwith ( Y t − d +1: t − , U t − d : t − ) will serve as coordinator i ’scompressed common information. Definition 10 (Compressed Common Information) . We definecoordinator i ’s Compressed Common Information (CCI) attime t as B it = (cid:18)(cid:16) Π i,lt (cid:17) l ∈I , Y t − d +1: t − , U t − d : t − (cid:19) , here (Π i,lt ) l ∈I are generated using the belief generationsystem defined in Definition 9. Note that when d = 1 , wehave B it = ((Π i,lt ) l ∈I , U t − ) .We can write the belief update using B it as Π i,kt +1 = ψ i,kt ( B it , Y t , U t ) . With a slight abuse of notation, we use ψ it torepresent the collection ( ψ i,kt ) k ∈I and write the belief updatescollectively as (Π i,lt +1 ) l ∈I = ψ it ( B it , Y t , U t ) .We now define a subclass of strategies where coordinator i uses her CCI instead of the entire common information. Definition 11 (Compressed Information Based Strategy) . Let B t = (cid:0)Q k ∈I ∆( S kt ) (cid:1) ×Y t − d +1: t − ×U t − d : t − . A CompressedInformation Based (CIB) strategy for coordinator i is a pair ( λ i , ψ i ) , where λ i = ( λ it ) t ∈T is a collection of functions λ it : B t × S it ∆( Γ it ) , and ψ i = ( ψ i,kt ) k ∈I ,t ∈T , ψ i,kt : B t × Y t ×U t ∆( S kt +1 ) is a belief generation system as defined inDefinition 9.Under a CIB strategy, coordinator i uses her belief genera-tion system to compress common information into beliefs andthen uses these beliefs along with ( Y t − d +1: t − , U t − d : t − , S it ) to select a randomized prescription. Thus, a CIB strategy ( λ i , ψ i ) is equivalent to an SPIB-strategy ρ it ( h t , s it ) = λ it (cid:18)(cid:16) π i,kt (cid:17) k ∈I , y t − d +1: t − , u t − d : t − , s it (cid:19) ∀ h t ∈ H t , ∀ s it ∈ S it where ( π i,kt ) k ∈I is generated from h t through the beliefgeneration system defined in Definition 9. Remark 6.
One advantage of CIB strategies is that at eachtime coordinator i only needs to use her current CCI ratherthan the time-increasing full common information (i.e. H t ).Thus, if the sets X t , Y t , U t are time-invariant, the mappings λ it , ψ it in a CIB strategy have a time-invariant domain. Remark 7.
We have not imposed any restriction on themapping ψ it in coordinator i ’s belief generation system (seeDefinition 9). Intuitively, however, one can imagine that co-ordinator i has some prediction about others’ strategies andis rationally using her prediction about others’ strategies toupdate her beliefs through the mapping ψ it . In the followingdiscussion, our focus will be on such “rational” ψ it where thenotion of rationality will be captured by Bayes’ rule.Coordinator i ’s belief generated from ψ i can be groupedinto two parts: (Π i, − it ) t ∈T and (Π i,it ) t ∈T . The first part rep-resents what coordinator i believes about other coordinators’SPI. The second part represents what coordinator i thinks isthe other coordinators’ belief on her own SPI. B. Consistency and Closedness of CIB Strategies
As mentioned before, our interest in CIB strategies ismotivated by the common information belief based strategiesthat appeared in the solution of decentralized control problems[34], [36] or games among individuals [54], [55]. The commonbeliefs used in these prior works are compatible with Bayes’rule (i.e. the beliefs can be obtained using Bayes’ rule alongwith the knowledge of the system model and the strategies being used). Inspired by these observations, we are particularlyinterested in CIB strategies where the belief generation systemis compatible with Bayes’ rule, i.e. the beliefs generated bycoordinator i using ψ i agree with those generated using Bayes’rule along with the knowledge of the system model and thestrategies being used.In the following discussion, we identify a key property ofsuch Bayes’ rule compatible CIB strategies. To do so, we usethe following technical definition. Definition 12 (Consistency) . Given λ it : B t × S it ∆( Γ it ) ,a belief generation function ψ ∗ ,it : B t × Y t × U t ∆( S it +1 ) is said to be consistent with λ it if the following holds: For all b t = (( π lt ) l ∈I , y t − d +1: t − , u t − d : t − ) ∈ B t , ψ ∗ ,it ( b t , y t , u t ) isequal to the conditional distribution of S it +1 given the event ( Y t = y t , U t = u t ) found using Bayes rule (wheneverBayes rule applies), assuming that y t − d +1: t − and u t − d : t − are the realization of recent observations and actions, S it hasprior distribution π it , and given S it = s it , Γ it has distribution λ it ( b t , s it ) . That is, [ ψ ∗ ,it ( b t , y t , u t )]( s it +1 ) = Υ it ( b t , y it , u t , s it +1 ) P ˜ s it +1 Υ it ( b t , y it , u t , ˜ s it +1 ) (3)whenever the denominator of (3) is non-zero, where Υ it ( b t , y it , u t , s it +1 ):= X ˜ s it X ˜ x it − d +1: t X ˜ γ it :˜ γ it (˜ x it − d +1: t )= u it h P ( y it | ˜ x it , u t ) ×× { s it +1 = ι it (˜ s it , ˜ x it − d +1 , ˜ γ it ) } λ it (˜ γ it | b t , ˜ s it ) ×× P it (˜ x it − d +1: t | y it − d +1: t − , u t − d : t − , ˜ s it ) π it (˜ s it ) i for all b t = (( π lt ) l ∈I , y t − d +1: t − , u t − d : t − ) ∈ B t , y it ∈ Y it ,u t ∈ U t , s it +1 ∈ S it +1 ,ι it is defined in (2) and P it is as described in Lemma 5.For any index set Ω ⊂ I × T We say that ψ ∗ ,i =( ψ ∗ ,it ) ( i,t ) ∈ Ω is consistent with λ i = ( λ it ) ( i,t ) ∈ Ω if ψ ∗ ,it isconsistent with λ it for all ( i, t ) ∈ Ω .A CIB strategy ( λ i , ψ i ) for coordinator i is said tobe self-consistent if ψ i,i is consistent with λ i . Since self-consistency can be viewed as Bayes’ rule compatibility, thebeliefs (Π i,it ) t ∈T represents true conditional distributions ofcoordinator i ’s SPI given the common information under aself-consistent strategy. Lemma 7.
Let ( λ i , ψ i ) be a self-consistent CIB strategy ofcoordinator i . Denote the behavioral strategy generated from ( λ i , ψ i ) as g i . Let h t ∈ H t be admissible under g i t − , then P g i t − ( s it , x it − d +1: t | h t )= π i,it ( s it ) P it ( x it − d +1: t | y it − d +1: t − , u t − d : t − , s it ) ∀ s it ∈ S it ∀ x it − d +1: t ∈ X it − d +1: t where π i,it is the belief obtained using ψ i under the realization h t of common information and P it is as described in Lemma5.roof. See Appendix G.Now, consider a game with two coordinators: Suppose thatcoordinator 1 plays a self-consistent CIB strategy with beliefgeneration system ψ . Since the belief Π , t generated from ψ is a true conditional distribution on coordinator 1’s SPI,coordinator 2 can use Π , t as her belief on coordinator 1’s SPI.Further, coordinator 2 can use ψ to compute coordinator 1’sbelief about coordinator 2’s SPI. This suggests that coordinator2 should mimic coordinator 1’s belief generation system whencoordinator 1’s strategy is self-consistent. This observation,along with results from Markov decision theory, lead to thefollowing crucial best-response property of CIB strategies. Lemma 8 (Closedness of CIB strategies) . Suppose that allcoordinators other than coordinator i are using self-consistentCIB strategies. Let ( λ k , ψ k ) k ∈I\{ i } be the CIB strategy profileof coordinators other than i . Suppose that ψ j = ψ k for all j, k ∈ I\{ i } . Then, a best-response strategy for coordinator i is a CIB strategy with the same belief generation system asthe other coordinators.Proof. See Appendix H.
C. Interpretation and Discussion of Consistency and Closed-ness Property
Lemma 8 imposes two conditions on the CIB strategies ofcoordinators other than i , namely (I) they are self-consistent,and (II) their belief generation systems are identical. In orderto illustrate the significance of both conditions, we first de-scribe how coordinator i could form her best response whenall coordinators other than i are playing some generic CIBstrategies that are not necessarily self-consistent or havingidentical belief generation system.The problem of finding coordinator i ’s best response toothers’ CIB strategies can be thought of as a stochastic controlproblem with partial observation. This suggest that in order toform a best response at time t , coordinator i needs to compute(or form beliefs on) the data that coordinators − i ’s CIBstrategies use, i.e. the CCI and the SPI of other coordinators.Coordinator i also needs to estimate all the hidden informationin order to evaluate the payoffs. Coordinator i ’s estimationtask can be divided into three sub-tasks: (i) to form a beliefon her own hidden information X it − d +1: t , (ii) to recovercoordinators − i ’s CCI ( B kt ) k ∈I\{ i } , and (iii) to form a beliefon coordinators − i ’s SPI and hidden information X − it − d +1: t .For the first sub-task, coordinator i can compute the beliefusing ( Y it − d +1: t − , U t − d : t − , S it ) through the function P it defined in Lemma 5, without using any belief generationsystem. For the second sub-task, recall that B kt includes ( Y it − d +1: t − , U t − d : t − ) , which coordinator i already knows.Thus, to complete the second task, coordinator i can simplyuse ( ψ k ) k ∈I\{ i } and the common information H t to computeall the beliefs in ( B kt ) k ∈I\{ i } . Condition (I), namely that theCIB strategies for coordinators other than i are self-consistent,ensures that coordinator i can also accomplish the third sub-task using the beliefs in ( B kt ) k ∈I\{ i } due to Lemma 7. By us-ing self-consistent CIB strategies, coordinators − i effectively “invite” coordinator i to use the same belief generation systemas − i .Thus, all of coordinator i ’s sub-tasks can be done if shekeeps track of her own S it and the CCI ( B kt ) k ∈I\{ i } used byothers. Therefore, coordinator i can form a best response witha strategy that chooses prescriptions based on ( B kt ) k ∈I\{ i } and S it at time t . Condition (II), namely that the belief generationsystems are identical, ensures that B kt ’s are identical for all k ∈I\{ i } and hence the best response described above becomesa CIB strategy with the same belief generation system as theone used by all coordinators other than i . Remark 8.
Note the CIB strategy that is a best-responsestrategy for coordinator i in Lemma 8 may not necessarily beself-consistent. However, the equilibrium strategies in a CIB-CNE (which we will introduce later) will be self-consistentfor all players. D. Coordinators’ Nash Equilibrium in CIB Strategies andSequential Decomposition
The fact that one of coordinator i ’s best responses to othersusing CIB strategies (with identical and self-consistent beliefgeneration systems) is itself a CIB strategy (with the samebelief generation system as others) suggests the possibility of aCoordinators’ Nash Equilibrium (CNE) where all coordinatorsare using CIB strategies with identical and self-consistentbelief generation systems. We refer to such a CNE as a CIB-CNE. More formally, a CIB-CNE is a CIB strategy profile ( λ ∗ i , ψ i ) i ∈I where (i) all coordinators have the same beliefgeneration system, i.e., for all for all i ∈ I , ψ i = ψ ∗ for some ψ ∗ , (ii) for each k ∈ I , ψ ∗ ,k is consistent with λ k , and (iii)for each i ∈ I , the CIB strategy ( λ ∗ i , ψ i ) is a best responsefor coordinator i to ( λ ∗ k , ψ k ) k ∈I\{ i } .Notice that in a CIB-CNE all coordinators are using thesame belief generation system, hence the CCI B it (as definedin Definition 10) is the same for all coordinators. We denotethe identical B it for all coordinators by B t . Furthermore, whenall coordinators other than i are using fixed CIB strategies, ( B t , S it ) can be viewed as an information state for coordinator i ’s stochastic control problem (see proof of Lemma 8 fordetails). Based on this observation, we introduce a backwardinductive computation procedure for determining CIB-CNEswhere B t is used as an information state. Our proceduredecomposes the game into a collection of one-stage games,one for each time t and each realization of B t . These one-stage games are used to characterize a CIB-CNE in a backwardinductive manner. Definition 13 (Stage Game) . Given the value functions V t +1 = ( V it +1 ) i ∈I , where V it +1 : B t +1 × S it +1 R ,a realization of the CCI b t = ( π t , y t − d +1: t − , u t − d : t − ) where π t = ( π it ) i ∈I , π it ∈ ∆( S it ) , and update functions ψ ∗ t = ( ψ ∗ ,it ) i ∈I , ψ ∗ ,it : B t × Y t × U t ∆( S it +1 ) , we define astage game for the coordinators dynamic game as follows: Stage Game G t ( V t +1 , b t , ψ ∗ t ) : • There are |I| players, each representing a coordinator. • ( V t +1 , b t , ψ ∗ t ) are commonly known. Nature chooses Z t = ( S t , X t − d +1: t , W Yt ) , where S t =( S kt ) k ∈I . • Player i observes S it = s it . • Player i ’s belief on Z t is given by β it (˜ z t | s it ) = { ˜ s it = s it } Y k = i π kt (˜ s kt ) ×× Y k ∈I P kt (˜ x kt − d +1: t | y kt − d +1: t − , u t − d : t − , ˜ s kt ) P ( ˜ w k,Yt ) , ∀ ˜ z t = (˜ s t , ˜ x t − d +1: t , ˜ w Yt ) ∈ S t × X t − d +1: t × W Yt . (4)where P kt is the belief function defined in Eq. (1). • Player i selects a prescription Γ it ∈ Γ it as her action. • Player i has utility Q it ( Z t , Γ t ) = r it ( X t , U t ) + V it +1 ( B t +1 , S it +1 ) , (5)where U k,jt = Γ k,jt ( X k,jt ) ∀ ( k, j ) ∈ N ,B t +1 = ((Π kt +1 ) k ∈I , ( y t − d +2: t − , Y t ) , ( u t − d +1: t − , U t )) , Π kt +1 = ψ ∗ ,kt ( b t , Y t , U t ) ∀ k ∈ I ,Y kt = ℓ jt ( X kt , U t , W k,Yt ) ∀ k ∈ I ,S it +1 = ι it ( S it , X it − d +1 , Γ it ) Given the stage game G t ( V t +1 , b t , ψ ∗ t ) , we define twoassociated concepts: Definition 14 (IBNE Correspondence) . Given the value func-tions V t +1 = ( V it +1 ) i ∈I , where V it +1 : B t +1 × S it +1 R and belief update functions ψ ∗ t = ( ψ ∗ ,it ) i ∈I , ψ ∗ ,it : B t ×Y t × U t ∆( S it +1 ) , the Interim Bayesian Nash Equilibriumcorrespondence
IBNE t ( V t +1 , ψ ∗ t ) is defined as the set of all λ t = ( λ it ) i ∈I , λ it : B t × S it ∆( Γ it ) such that λ it ( b t , s it ) ∈ arg max η ∈ ∆( Γ it ) X ˜ z t , ˜ γ t η (˜ γ it ) Q it (˜ z t , ˜ γ t ) β it (˜ z t | s it ) Y k = i λ kt (˜ γ kt | b t , ˜ s kt ) ∀ b t ∈ B t , s it ∈ S it , ∀ i ∈ I , where β it and Q it are defined using ( V it +1 , b t , ψ ∗ t ) in (4) and(5) respectively. Definition 15 (DP Operator) . Given a value function V it +1 : B t +1 × S it +1 R and a CIB strategy profile ( λ ∗ t , ψ ∗ t ) attime t , where λ ∗ t = ( λ ∗ it ) i ∈I , λ ∗ it : B t × S it ∆( Γ it ) and ψ ∗ t = ( ψ ∗ ,it ) i ∈I , ψ ∗ ,it : B t × Y t × U t ∆( S it +1 ) , the dynamicprogramming operator DP it defines the value function at time t through [DP it ( V it +1 , λ ∗ t , ψ ∗ t )]( b t , s it ):= X ˜ z t , ˜ γ t Q it (˜ z t , ˜ γ t ) β it (˜ z t | s it ) Y k ∈I λ ∗ kt (˜ γ kt | b t , ˜ s kt ) , Since X t , U t , Y t are finite sets, one can assume that W Yt also takes finitevalues without lost of generality. where β it and Q it are defined using ( V it +1 , b t , ψ ∗ t ) in (4) and(5) respectively. Theorem 2 (Sequential Decomposition) . Let ( λ ∗ i , ψ ∗ ) i ∈I bea CIB strategy profile with identical belief generation system ψ ∗ for all i ∈ I . If this strategy profile satisfies the dynamicprogram defined below: V iT +1 ( · , · ) = 0 ∀ i ∈ I ; and for t ∈ T λ ∗ t ∈ IBNE t ( V t +1 , ψ ∗ t ); (6) ψ ∗ t is consistent with λ ∗ t ; (7) V it := DP it ( V it +1 , λ ∗ t , ψ ∗ t ) ∀ i ∈ I , then ( λ ∗ i , ψ ∗ ) i ∈I forms a CIB-CNE.Proof. See Appendix I.
Remark 9.
Note that (6) and (7) can be verified for eachrealization b t ∈ B t separately, i.e., one can check that λ ∗ t ( b t , · ) is an IBNE of the stage game game G t ( V t +1 , b t , ψ ∗ t ( b t , · )) , andthat ψ ∗ t ( b t , · ) is consistent with λ ∗ t ( b t , · ) for each b t . E. Existence of CIB-CNE
We have shown in Theorem 1 that an SPIB-CNE alwaysexists. However, a CIB-CNE does not necessarily exist, evenwhen each team contains only one member (i.e. in gamesamong individuals). We present below one example whereCIB-CNEs do not exist.
Example 3.
Consider a 3-stage dynamic game (i.e. T = { , , } ) with two players: Alice (A) and Bob (B). Eachplayer forms a one-person team. Let X At ∈ {− , } and X Bt ≡ ∅ , i.e. Bob is not associated with a state. Let Y t = ∅ ,i.e. there is no public observation of the states. The initial state X A is uniformly distributed on {− , } . At t = 1 , (a) Alicecan choose an action U A ∈ {− , } and Bob has no actionsto take; (b) the next state is given by X A = X A · U A ; (c) theinstantaneous reward is given by r A ( X , U ) = − r B ( X , U ) = ε · { U A =+1 } , where ε ∈ (0 , ) .At t = 2 , (a) neither player has any action to take; (b) thestate at next time is given by X A = X A ; (c) the instantaneousrewards are 0 for both players; (This stage is a dummy stageinserted in the game to alter the definition of the CCI at thebeginning of the last stage.)At t = 3 , (a) Alice has no action to take, and Bob chooses U B ∈ { L , R } ; (b) The instantaneous reward r A ( X , U ) forAlice is given by r A ( − , L) = 0 , r A ( − , R) = 1 r A (+1 , L) = 2 , r A (+1 , R) = 0 and r B ( X , U ) = − r A ( X , U ) .In a game where each team contains only one person, wecan assume the delay d to be any number (see Remark 1). Inthe next proposition, we view Example 3 as a game amongteams with internal delay d = 1 . roposition 1. There exist no CIB-CNE in the game describedin Example 3.Proof.
See Appendix J.
Remark 10.
One can provide an example for non-existence ofCIB-CNE for any d > by inserting d − additional dummystages (analogous to stage 2) into Example 3, and viewing itas a game among teams with internal delay d . Example 3 canalso be used to show that the CIB-PBE concept defined in[55] for games among individuals does not exist in general,hence the conjecture in [55] that a CIB-PBE always exists isnot true.Intuitively, the reason that a CIB-CNE does not exist inthis game is that at t = 3 , a CIB strategy requires Bob tochoose his action based only on a compressed version of hisinformation rather than the full information. This compressiondoes not hurt Bob’s ability to form a best response. However,in an equilibrium, Bob needs to carefully choose from theset of optimal responses to induce Alice to play the predictedmixed strategy. Being unable to choose different actions underdifferent histories due to information compression makes Bobunable to sustain an equilibrium. In this game, as in the ex-ample in [49], payoff irrelevant information plays an essentialrole in sustaining the equilibrium.In the remainder of this section we present two subclasses ofthe dynamic games described in Section II where CIB-CNEsexist.
1) Signaling-Neutral Teams:
In this subsection we consider d = 1 . One subclass of games where CIB-CNEs exist is whenthe teams are signaling-neutral . In these games, the agents areindifferent in terms of signaling to other teams, i.e. revealingmore or less information about their private information to theother teams does not affect their utility. (Note that agents canalways actively reveal information to their teammates throughtheir actions.)We shall now describe the game: Definition 16.
A team i whose state X it can be recoveredfrom ( Y it , U t ) (i.e. for every fixed u t , ℓ it ( x it , u t , W i,Yt ) hasdisjoint support for different x it ∈ X it ) is called a public team.Otherwise, it is called private team.For a public team i , the private state X it − is effectivelypart of the common information of all members of all teams. Definition 17 (Information Dependency Graph) . The infor-mation dependency graph G of a dynamic game is a directedgraph defined as follows: The vertices represent the teams. Adirected edge i ← j is present if either the state transition, theobservation, or the instantaneous reward of team i at some time t depends directly on either the state or the action of team j . Inother words, there is no directed edge from j to i if and only if X it +1 = f it ( X it , U − jt , W i,Xt ) , Y it = ℓ it ( X it , U − jt , W i,Yt ) and r it ( X t , U t ) = r it ( X − jt , U − jt ) for some functions f it , ℓ it , r it forall t . Self loops are not considered in this graph. Theorem 3.
Let d = 1 . If every strongly connected componentof the information dependency graph G of a dynamic game consists of either (I) a single team, or (II) multiple publicteams, then a CIB-CNE exists.Proof. See Appendix K.
Remark 11.
The precedence relation among teams consideredin Theorem 3 is similar to the s -partition of teams that waspresented and analyzed in [67].When the condition in Theorem 3 is satisfied, all teamswill be neutral in signaling: When a private team i sendsinformation, this information is only useful to those teamswhose actions do not affect team i ’s utility. Public players arealways neutral in signaling since their state history is publiclyavailable.Notice that in Example 3, Alice (as a one-person team) isa private team while Bob is a public team. The instantaneousreward of Bob at t = 3 depends on Alice’s state X A , whileAlice’s instantaneous reward at t = 3 depends on Bob’s action.Hence Alice and Bob form a strongly connected componentin the information dependency graph.
2) Signaling-Free Equilibria:
In this section, we introduceanother class of games where CIB-CNE exists. These gamesare games-among-teams extension of
Game M defined in [55].We present the result for a general d > . Example 4.
Consider a dynamic game that satisfies thefollowing conditions. • States are uncontrolled, i.e. X it +1 = f it ( X it , W i,Xt ) . • Observations are uncontrolled, i.e. Y it = ℓ it ( X t , W i,Yt ) . • Instantaneous rewards of team i can be expressed as r it ( X − it , U t ) . Theorem 4.
A dynamic game that satisfies the above condi-tions has a CIB-CNE.Proof.
See Appendix L for a direct proof. Alternatively, onecan first assume that the teams share information with a delayof d = 0 , then we can view a team as one individual sinceteam members have the same information. Then one can applyresults for Game M in [55] to obtain an equilibrium where eachplayer/team plays a public strategy (i.e. a strategy that doesnot use private information), in particular, a strategy whereactions are solely based on the common information basedbelief. Since public strategies can also be played when d > , we conclude that the equilibrium we obtained is also anequilibrium for the original game.VI. ADDITIONAL RESULTS A. Separated Dynamics
Consider a special case of the model in Section II wherethe state of each member of each team evolves independentlygiven the actions, i.e. X i,jt +1 = f i,jt ( X i,jt , U t , W i,jt ) , where ( W i,jt ) t ∈T , ( i,j ) ∈N are mutually independent primitiverandom variables. In this case, we show that the independenceamong team members’ state dynamics enable us to considerequilibria where the coordinators assign prescriptions that map X i,jt to U i,jt (instead of mapping X i,jt − d +1: t to U i,jt ); thiss because, given H it , the belief of member ( i, j ) about herteammates’ states is independent of X i,jt − d +1: t . In other words,one can replace the hidden information X it − d +1: t with the sufficient hidden information X it . Definition 18 (Simple Prescriptions) . A simple prescription for coordinator i at time t is a collections of functions θ it =( θ i,jt ) ( i,j ) ∈N i , θ i,jt : X i,jt
7→ U i,jt . Lemma 9.
Suppose that g − i is a behavioral coordinationstrategy profile for coordinators other than coordinator i , thenthere exists a best response behavioral coordination strategy g i for coordinator i that chooses randomized simple prescriptionsbased on H it .Proof. See Appendix M.Given the above result, one can restrict attention to sufficienthidden information based strategies where each coordinator i assigns simple prescriptions based on H it . Consequently,results analogue to that of Sections IV and V can be derivedconsidering similar compression of private and common infor-mation. B. Refinement of Coordinators’ Nash Equilibrium
In the game among coordinators, one can also considerCoordinators’ weak Perfect Bayesian Equilibrium (wPBE) [68]as a refinement of CNE. Coordinator’s wPBE provides arefinement of Coordinator’s Nash Equilibrium by ruling outequilibrium outcomes that rely on non-credible threats [69]. Definition 19 (Coordinators’ wPBE) . Define H ∗ t = X t ×Y t − × U t − × Γ t − . Let g denote a behavioral co-ordination strategy profile of all coordinators and ϑ =( ϑ it ) i ∈I ,t ∈T , ϑ it : H it ∆( H ∗ t ) denote a belief system. Thestrategy profile g is said to be sequentially rational given ϑ if g it : T ∈ arg max ˜ g it : T J it (˜ g it : T , g − it : T ; ϑ it , h it ) ∀ h it ∈ H it , ∀ i ∈ I , ∀ t ∈ T where J it (˜ g t : T ; ϑ it , h it ) := X ˜ h ∗ t E ˜ g t : T " T X τ = t r iτ ( X τ , U τ ) (cid:12)(cid:12)(cid:12) ˜ h ∗ t ϑ it (˜ h ∗ t | h it ); the belief system ϑ is said to be consistent with g [68] if P g ( h it ) > ⇒ ϑ it (˜ h ∗ t | h it ) = P g (˜ h ∗ t , h it ) P g ( h it ) ∀ ˜ h ∗ t ∈ H ∗ t ∀ h it ∈ H it ∀ t ∈ T ∀ i ∈ I . A pair ( g, ϑ ) is called a Coordinators’ wPBE if g issequentially rational given ϑ and ϑ is consistent with g . The compression of hidden information to sufficient hidden informationis similar to the shredding of irrelevant information in [35]. We refer an interested reader to Chapter 9 of [68] for a detailed descriptionof wPBE.
Let ρ be an SPIB strategy profile and ϑ to be a belief system.A pair ( ρ, ϑ ) is called an SPIB-wPBE if it forms a wPBE. Proposition 2.
SPIB-wPBE exists in the game among coordi-nators.Proof.
The proof follows steps similar to the proof of Theorem1. As a result of the sequential decomposition of the dynamicgame, with some assumptions on the belief generation sys-tems, a CIB-CNE obtained from the sequential decompositionis a wPBE as well, where the beliefs ϑ can be derived fromthe CCI. This is formalized in the following proposition. Definition 20.
Define ˆ Π t ( h t ) := { π t ∈ Y k ∈I ∆( S kt ) : ∃ g t − , s . t . P g t − ( h t ) > , Y k ∈I π kt ( s kt ) = P g t − ( s t | h t ) ∀ s t ∈ S t } . A belief generation system ψ ∗ = ( ψ ∗ t ) t ∈T , ψ ∗ t : B t ×Y t ×U t → Q k ∈I ∆( S kt ) is said to be regular if for all t ∈ I , we have ψ ∗ t ( π t , y t − d +1: t , u t − d : t ) ∈ ˆ Π t +1 ( h t +1 ) for all π t ∈ ˆ Π t ( h t ) .Intuitively, a belief generation system is regular if it assignspositive probability only to realizations of SPI that are admis-sible under some strategy profile g . Proposition 3.
Let ( λ ∗ , ψ ∗ ) be a CIB strategy profile thatsatisfies the condition of Theorem 2. Assume that ψ ∗ is regular.Let g ∗ be the behavioral coordination strategy profile inducedfrom ( λ ∗ , ψ ∗ ) . Then there exist a belief system ϑ ∗ such that ( g ∗ , ϑ ∗ ) forms a Coordinators’ wPBE.Proof. See Appendix N.VII. DISCUSSION
A. Implementation of Behavioral Coordination Strategies
One can also interpret behavioral coordination strategies asstrategies with coordinated randomization, i.e., the strategiesare randomized, but all the team members know exactlyhow this randomization is done. We note that one can viewthe main purpose of randomization as to “confuse” otherteams. As such, it is best to use coordinated randomizationwhere every team member knows what partial mapping theirteammate is using; such coordinated randomization is superiorto private and independent randomization by each individualmember in a team: This is since individual randomization cancreate information that are unknown to teammates, while thesame “confusion” effect to other teams can be achieved withcoordinated randomization.To implement behavioral coordination strategies, a team canutilize a correlation device which generates a random seed ateach time t . Then each member ( i, j ) of the team i can choosean action based on H i,jt and present and past random seedsgenerated by the correlation device, or equivalently, choose anaction based on ( H i,jt , Γ i t − ) where Γ i t − is sequentiallyupdated. If the behavioral coordination strategy is a CIB strat-egy, then member ( i, j ) need to use ( B t , X it − d , Φ it , X i,jt − d +1: t ) nd current random seed to chose an action, where ( B t , Φ it ) are sequentially updated.In the absence of correlation devices accessible at everytime, a behavioral coordination strategy can also be imple-mented as its equivalent mixed strategy (recall Lemma 1 andLemma 2): Before the beginning of the game, the team canjointly pick a strategy profile in G i randomly, according to adistribution induced from the behavioral coordination strategy. B. Stage Game: IBNE vs BNE
One can observe that the belief of the agents defined inthe stage game (Definition 13) can be seen as a conditionaldistribution derived from the common prior β t (˜ z t )= Y k ∈I h π kt (˜ s kt ) P kt (˜ x kt | y kt − d +1: t − , u t − d : t − , ˜ s kt ) P ( ˜ w k,Yt ) i . (8)However, in the aforementioned stage game we focus on thebeliefs of agents instead of a common prior, and we use Interim Bayesian Nash Equilibrium (IBNE) as the equilibriumconcept instead of BNE. This is since, unlike a standardBayesian game with a common prior, the true prior of the stagegame is dependent on the actual strategy played in previousstages. The prior β t described in (8) may not be a true prior,since some coordinator i may have already deviated fromthe strategy prediction which π it ’s were relying on. However,coordinator i is always trying to optimize her reward given ( b t , s it ) , no matter π it ( s it ) = 0 or not. Hence in this stagegame, we must consider the player’s belief and strategy forall possible realizations s it under any strategy profile, not justthose with positive probability under the prior in (8). Thecorresponding equilibrium concept is Interim Bayesian Equi-librium instead of Bayes-Nash Equilibrium. IBNE strengthensBNE by requiring the strategy of an agent to be optimal under all private information realizations, including those with zeroprobability under the common prior. C. Choice of Compressed Common Information
In decentralized control [34] and certain settings of gamesamong individuals [54], [55], a common information basedbelief Π t on the state is usually enough to serve as aninformation state, or compression of common information.However, in our setting we use a subset of actions andobservations in addition to the CIB belief as the compressedcommon information. We argue below that this is necessaryfor our setting.To illustrate the point, consider the case d = 1 and assumethat all coordinators use the same belief generation systemand hence the same CCI (denoted by B ∗ t ). An alternativefor the CCI B ∗ t = ((Π ∗ ,it ) i ∈I , U t − ) is the CIB belief ˜ Π ∗ t = ( ˜Π ∗ ,it ) i ∈I , ˜Π ∗ ,it ∈ ∆( X it − t ) where ˜Π ∗ ,it represents thebelief on X it − t based on common information. One mightargue that we can use ˜ Π ∗ t instead of B ∗ t through the followingargument: After we transform the game into games amongcoordinators, because of the full recall of coordinator i , coor-dinator i ’s belief (on other coordinator’s private information and all hidden information) is independent of her behavioralcoordination strategy ˜ g i . Hence coordinator i can always formthis belief as if she was using the strategy prediction g ∗ i nomatter what strategy she is actually using.However this argument can run into technical problems:A crucial step for Lemma 8 is Eq. (20), which establishesthat coordinator i ’s belief can be expressed as a functionof ( B ∗ t , X it − ) for any behavioral coordination strategy ˜ g i coordinator i might use. To use ˜ Π ∗ t alone as the informationstate, one need to argue that coordinator i ’s belief on herhidden information, P ( X it = ·| x it − , u t − ) , can be computedsolely through (˜ π ∗ ,it , x it − ) without using u t − . Through beliefindependence of strategy, one may argue that P ( x it | x it − , u t − ) = P g ∗ i ,g ∗− i ( x it | x it − , u t − )= P g ∗ ,i ,g ∗ , − i ( x it | x it − , y t − , u t − )= P g ∗ ,i ,g ∗ , − i ( x it , x it − | y t − , u t − ) P g ∗ ,i ,g ∗ , − i ( x it − | y t − , u t − )= ˜ π ∗ ,it ( x it − , x it ) P ˜ x it ˜ π ∗ ,it ( x it − , ˜ x it ) . (9)However, the above argument is not always valid. It is onlyvalid when the denominator of (9) is non-zero, but it can bezero. One simple example is as the following: Let ˆ x it − ∈X it − be some fixed state and ˆ u it − ∈ X it − be some fixedaction profile. Let ˆ Γ it − be the set of prescriptions that maps ˆ x it − to ˆ u it − . Suppose that the strategy prediction g ∗ i is abehavioral coordination strategy satisfying the following: g ∗ it − ( h it − )( γ it − ) = 0 ∀ h it − ∈ H it − , γ it − ∈ ˆ Γ it − , i.e. under g ∗ i , coordinator i never assigns any prescriptionthat maps ˆ x it − to ˆ u it − . If ˜ π ∗ ,it is consistent with the strategyprediction g ∗ i , then X ˜ x it ˜ π ∗ ,it (ˆ x it − , ˜ x it ) = P g i ,g − i (ˆ x it − | h t ) = 0 if u it − = ˆ u it − . When coordinator i use a strategy ˜ g i suchthat X it − = ˆ x it − , U it − = ˆ u it − could happen with non-zeroprobability, coordinator i cannot use ˜ π ∗ ,it to form her beliefon her hidden information. This is contrary to what we needin Eq. (20) in the proof of Lemma 8, which states that thebelief function is compatible with any behavioral coordinationstrategy ˜ g i . D. Connection with Sufficient Information Approach
The compression of private information of coordinators inour model can be seen as an application of Tavafoghi et al.’s[36] sufficient information approach. One can show that oursufficient private information S it = ( X it − d , Φ it ) satisfy thedefinition of sufficient private information (Definition 4) in[36] (hence we choose to use the same terminology): (i) Itcan be sequentially updated; (ii) it is sufficient for estimatingfuture private information; (iii) it is sufficient for estimatingcost; (iv) it is sufficient for estimating others’ information atcurrent time. [36] We note that (i) is true due to Eq. (2);(ii) and (iii) are established and utilized in our Lemma 6;iv) is true because of conditional independence between thecoordinators (our Lemma 3). In [36], the authors proved thatone can compress the common information into the sufficientcommon information (SCI) and consider sufficient informationbased strategies, which choose actions based on SCI and SPI.The SCI is defined to be the common information based beliefon the sufficient private information along with the systemstate, which is X t − d +1: t in our case. As we discussed inSection V-A, our CCI B t can be used to create the beliefon X t − d +1: t , hence our CCI B t (defined in Definition 10) isplaying the role of the SCI.VIII. CONCLUSION AND FUTURE WORKWe studied a model of dynamic games among teams withasymmetric information, where agents in each team share theirobservations with a delay of d . Each team is associated with acontrolled Markov Chain, whose dynamics are controlled bythe actions of all agents. We developed a general approachto characterize a subset of Nash Equilibria with the followingfeature: At each time, each agent can make their decision basedon a compressed version of their information, instead of thefull information. We identified two subclasses of strategies:sufficient private information based (SPIB) strategies, whichonly compresses private information, and compressed informa-tion based (CIB) strategies, which compresses both commonand private information. We showed that while SPIB-strategy-based equilibria always exist, CIB strategy-based equilibriado not always exist. We developed a backward inductivesequential procedure, whose solution (if it exists) is a CIBstrategy-based equilibrium. We characterized certain gameenvironment where the solution exists. Our results highlightthe discord between compression of information, existence of(compression based) equilibria, and backward inductive se-quential computation of such equilibria in stochastic dynamicgames.Moving forward, there are a few research problems aris-ing from this work: (i) discovering broader conditions forthe existence of CIB-CNE in the model of this paper; (ii)developing an efficient algorithm which solves the dynamicprogram of CIB-CNE (when they exist); (iii) determiningminimal additional information needed to be added to theCCI such that CIB-CNE (under the new CCI) is guaranteedto exist; (iv) defining a notion of ǫ -CIB-CNE, analyzing itsexistence, and developing sequential computation proceduresto find them.Other future research directions include identifying a suit-able compression of information and developing a sequentialdecomposition for other models of games among teams, forexample (i) games with continuous state and action spaces(e.g. linear quadratic Gaussian settings), and (ii) general mod-els with non-observable actions.R EFERENCES[1] E. Maskin and J. Tirole, “A theory of dynamic oligopoly, i: Overviewand quantity competition with large fixed costs,”
Econometrica: Journalof the Econometric Society , pp. 549–569, 1988.[2] ——, “A theory of dynamic oligopoly, ii: Price competition, kinkeddemand curves, and edgeworth cycles,”
Econometrica: Journal of theEconometric Society , pp. 571–599, 1988. [3] T. Doganoglu, “Dynamic price competition with consumption external-ities,” netnomics , vol. 5, no. 1, pp. 43–69, 2003.[4] D. Bergemann and J. V¨alim¨aki, “Dynamic price competition,”
Journalof Economic Theory , vol. 127, no. 1, pp. 232–263, 2006.[5] L. Cabral, “Dynamic price competition with network effects,”
TheReview of Economic Studies , vol. 78, no. 1, pp. 83–111, 2011.[6] H. Tavafoghi, Y. Ouyang, D. Teneketzis, and M. Wellman, “Gametheoretic approaches to cyber security: Challenges, results, and openproblems,” in
Adversarial and Uncertain Reasoning for Adaptive CyberDefense: Control-and Game-theoretic Approaches to Cyber Security ,S. Jajodia, G. Cybenko, P. Liu, C. Wang, and M. Wellman, Eds.Springer Nature, 2019, vol. 11830, pp. 29–53.[7] S. Amin, X. Litrico, S. Sastry, and A. M. Bayen, “Cyber security ofwater SCADA systems – part i: Analysis and experimentation of stealthydeception attacks,”
IEEE Transactions on Control Systems Technology ,vol. 21, no. 5, pp. 1963–1970, 2012.[8] S. Amin, G. A. Schwartz, A. A. C´ardenas, and S. S. Sastry, “Game-theoretic models of electricity theft detection in smart utility networks:Providing new capabilities with advanced metering infrastructure,”
IEEEControl Systems Magazine , vol. 35, no. 1, pp. 66–81, 2015.[9] Q. Zhu and T. Bas¸ar, “Game-theoretic methods for robustness, security,and resilience of cyberphysical control systems: games-in-games prin-ciple for optimal cross-layer resilient control systems,”
IEEE ControlSystems Magazine , vol. 35, no. 1, pp. 46–65, 2015.[10] D. Shelar and S. Amin, “Security assessment of electricity distributionnetworks under DER node compromises,”
IEEE Transactions on Controlof Network Systems , vol. 4, no. 1, pp. 23–36, 2016.[11] M. Colombino, R. S. Smith, and T. H. Summers, “Mutually quadraticallyinvariant information structures in two-team stochastic dynamic games,”
IEEE Transactions on Automatic Control , vol. 63, no. 7, pp. 2256–2263,2017.[12] T. Summers, C. Li, and M. Kamgarpour, “Information structure designin team decision problems,”
IFAC-PapersOnLine , vol. 50, no. 1, pp.2530–2535, 2017.[13] P. A. Hancock, I. Nourbakhsh, and J. Stewart, “On the future of trans-portation in an era of automated and autonomous vehicles,”
Proceedingsof the National Academy of Sciences , vol. 116, no. 16, pp. 7684–7691,2019.[14] T. Harbert. (2014) Radio wrestlers fight it out atthe DARPA Spectrum Challenge. [Online]. Available:https://spectrum.ieee.org/telecom/wireless/radio-wrestlers-fight-it-out-at-the-darpa-spectrum-challenge[15] R. B. Myerson,
Game theory . Harvard university press, 2013.[16] H. Witsenhausen, “On the structure of real-time source coders,”
BellSystem Technical Journal , vol. 58, no. 6, pp. 1437–1451, 1979.[17] J. Walrand and P. Varaiya, “Optimal causal coding-decoding problems,”
IEEE Transactions on Information Theory , vol. 29, no. 6, pp. 814–820,1983.[18] D. Teneketzis, “On the structure of optimal real-time encoders anddecoders in noisy communication,”
IEEE Transactions on InformationTheory , vol. 52, no. 9, pp. 4017–4035, 2006.[19] A. Nayyar and D. Teneketzis, “On the structure of real-time encodingand decoding functions in a multiterminal communication system,”
IEEEtransactions on information theory , vol. 57, no. 9, pp. 6196–6214, 2011.[20] Y. Kaspi and N. Merhav, “Structure theorem for real-time variable-rate lossy source encoders and memory-limited decoders with sideinformation,” in
ISIT , 2010, pp. 86–90.[21] R. R. Tenney and N. R. Sandell, “Detection with distributed sensors,”
IEEE Transactions on Aerospace and Electronic systems , no. 4, pp. 501–510, 1981.[22] J. N. Tsitsiklis, “Decentralized detection,”
Advances in Statistical SignalProcessing , pp. 297–344, 1993.[23] D. Teneketzis and Y.-C. Ho, “The decentralized Wald problem,”
Infor-mation and Computation , vol. 73, no. 1, pp. 23–44, 1987.[24] V. V. Veeravalli, T. Bas¸ar, and H. V. Poor, “Decentralized sequentialdetection with a fusion center performing the sequential test,”
IEEETransactions on Information Theory , vol. 39, no. 2, pp. 433–442, 1993.[25] ——, “Decentralized sequential detection with sensors performing se-quential tests,”
Mathematics of Control, Signals and Systems , vol. 7,no. 4, pp. 292–305, 1994.[26] A. Nayyar and D. Teneketzis, “Sequential problems in decentralized de-tection with communication,”
IEEE transactions on information theory ,vol. 57, no. 8, pp. 5410–5435, 2011.[27] D. Teneketzis and P. Varaiya, “The decentralized quickest detectionproblem,”
IEEE Transactions on Automatic Control , vol. 29, no. 7, pp.641–644, 1984.[28] V. V. Veeravalli, “Decentralized quickest change detection,”
IEEE Trans-actions on Information theory , vol. 47, no. 4, pp. 1657–1665, 2001.29] P. Varaiya and J. Walrand, “Causal coding and control for Markovchains,”
Systems & control letters , vol. 3, no. 4, pp. 189–192, 1983.[30] A. Mahajan and D. Teneketzis, “Optimal performance of networkedcontrol systems with nonclassical information structures,”
SIAM Journalon Control and Optimization , vol. 48, no. 3, pp. 1377–1404, 2009.[31] H. S. Witsenhausen, “A standard form for sequential stochastic control,”
Mathematical systems theory , vol. 7, no. 1, pp. 5–11, 1973.[32] A. Mahajan, “Sequential decomposition of sequential dynamic teams:Applications to real-time communication and networked control sys-tems.” Ph.D. dissertation, Ph. D. dissertation, University of Michigan,Ann Arbor, 2008.[33] A. Nayyar, A. Mahajan, and D. Teneketzis, “Optimal control strategiesin delayed sharing information structures,”
IEEE Transactions on Auto-matic Control , vol. 56, no. 7, pp. 1606–1620, 2010.[34] ——, “Decentralized stochastic control with partial history sharing:A common information approach,”
IEEE Transactions on AutomaticControl , vol. 58, no. 7, pp. 1644–1658, 2013.[35] A. Mahajan, “Optimal decentralized control of coupled subsystems withcontrol sharing,”
IEEE Transactions on Automatic Control , vol. 58, no. 9,pp. 2377–2382, 2013.[36] H. Tavafoghi, Y. Ouyang, and D. Teneketzis, “A unified approach todynamic decision problems with asymmetric information – part i: Non-strategic agents,” arXiv , pp. arXiv–1812, 2018.[37] G. J. Mailath, J. George, L. Samuelson et al. , Repeated games andreputations: long-run relationships . Oxford university press, 2006.[38] T. Bas¸ar and G. J. Olsder,
Dynamic noncooperative game theory . SIAM,1999, vol. 23.[39] J. Filar and K. Vrieze,
Competitive Markov decision processes . SpringerScience & Business Media, 2012.[40] J. Renault, “The value of Markov chain games with lack of informationon one side,”
Mathematics of Operations Research , vol. 31, no. 3, pp.490–512, 2006.[41] ——, “The value of repeated games with an informed controller,”
Mathematics of operations Research , vol. 37, no. 1, pp. 154–179, 2012.[42] J. Zheng and D. A. Casta˜n´on, “Decomposition techniques for Markovzero-sum games with nested information,” in . IEEE, 2013, pp. 574–581.[43] F. Gensbittel and J. Renault, “The value of Markov chain games withincomplete information on both sides,”
Mathematics of OperationsResearch , vol. 40, no. 4, pp. 820–841, 2015.[44] L. Li and J. Shamma, “LP formulation of asymmetric zero-sum stochas-tic games,” in . IEEE,2014, pp. 1930–1935.[45] L. Li, C. Langbort, and J. Shamma, “Solving two-player zero-sumrepeated Bayesian games,” arXiv preprint arXiv:1703.01957 , 2017.[46] P. Cardaliaguet, C. Rainer, D. Rosenberg, and N. Vieille, “Markovgames with frequent actions and incomplete information—the limitcase,”
Mathematics of Operations Research , vol. 41, no. 1, pp. 49–71,2016.[47] D. Kartik and A. Nayyar, “Upper and lower values in zero-sumstochastic games with asymmetric information,”
Dynamic Games andApplications , pp. 1–26, 2020.[48] E. Maskin and J. Tirole, “Markov perfect equilibrium: I. observableactions,”
Journal of Economic Theory , vol. 100, no. 2, pp. 191–219,2001.[49] ——, “Markov equilibrium,” in
J. F. Mertens Memorial Conference ,2013. [Online]. Available: https://youtu.be/UNtLnKJzrhs[50] A. Nayyar and T. Bas¸ar, “Dynamic stochastic games with asymmetricinformation,” in . IEEE, 2012, pp. 7145–7150.[51] A. Gupta, A. Nayyar, C. Langbort, and T. Bas¸ar, “Common informa-tion based Markov perfect equilibria for linear-Gaussian games withasymmetric information,”
SIAM Journal on Control and Optimization ,vol. 52, no. 5, pp. 3228–3260, 2014.[52] A. Gupta, C. Langbort, and T. Bas¸ar, “Dynamic games with asymmet-ric information and resource constrained players with applications tosecurity of cyberphysical systems,”
IEEE Transactions on Control ofNetwork Systems , vol. 4, no. 1, pp. 71–81, 2016.[53] Y. Ouyang, H. Tavafoghi, and D. Teneketzis, “Dynamic oligopoly gameswith private Markovian dynamics,” in . IEEE, 2015, pp. 5851–5858.[54] A. Nayyar, A. Gupta, C. Langbort, and T. Bas¸ar, “Common informationbased Markov perfect equilibria for stochastic games with asymmetricinformation: Finite games,”
IEEE Transactions on Automatic Control ,vol. 59, no. 3, pp. 555–570, 2013. [55] Y. Ouyang, H. Tavafoghi, and D. Teneketzis, “Dynamic games withasymmetric information: Common information based perfect Bayesianequilibria and sequential decomposition,”
IEEE Transactions on Auto-matic Control , vol. 62, no. 1, pp. 222–237, 2016.[56] H. Tavafoghi, Y. Ouyang, and D. Teneketzis, “On stochastic dynamicgames with delayed sharing information structure,” in . IEEE, 2016, pp. 7002–7009.[57] H. Tavafoghi, “On design and analysis of cyber-physical systems withstrategic agents,” Ph.D. dissertation, Ph. D. dissertation, University ofMichigan, Ann Arbor, 2017.[58] D. Vasal, A. Sinha, and A. Anastasopoulos, “A systematic processfor evaluating structured perfect Bayesian equilibria in dynamic gameswith asymmetric information,”
IEEE Transactions on Automatic Control ,vol. 64, no. 1, pp. 81–96, 2019.[59] G. Farina, A. Celli, N. Gatti, and T. Sandholm, “Ex ante coordinationand collusion in zero-sum multi-player extensive-form games,” in
Con-ference on Neural Information Processing Systems (NIPS) , 2018.[60] Y. Zhang and B. An, “Computing team-maxmin equilibria in zero-sum multiplayer extensive-form games,” in
Proceedings of the AAAIConference on Artificial Intelligence , vol. 34, no. 02, 2020, pp. 2318–2325.[61] V. Anantharam and V. Borkar, “Common randomness and distributedcontrol: A counterexample,”
Systems & control letters , vol. 56, no. 7-8,pp. 568–572, 2007.[62] S. Bhattacharya and T. Bas¸ar, “Multi-layer hierarchical approach todouble sided jamming games among teams of mobile agents,” in . IEEE,2012, pp. 5774–5779.[63] C. A. Cox and B. Stoddard, “Strategic thinking in public goods gameswith teams,”
Journal of Public Economics , vol. 161, pp. 31–43, 2018.[64] D. J. Cooper and J. H. Kagel, “Are two heads better than one? Teamversus individual play in signaling games,”
American Economic Review ,vol. 95, no. 3, pp. 477–509, 2005.[65] H. Kuhn, “Extensive games and the problem of information, in: H.W.Kuhn and A.W. Tucker (eds.),”
Contributions to the Theory of Games ,vol. 2, pp. 193–216, 1953.[66] M. J. Osborne and A. Rubinstein,
A course in game theory . The MITPress, 1994.[67] T. Yoshikawa, “Decomposition of dynamic team decision problems,”
IEEE Transactions on Automatic Control , vol. 23, no. 4, pp. 627–632,Aug 1978.[68] A. Mas-Colell, M. D. Whinston, J. R. Green et al. , Microeconomictheory . Oxford university press New York, 1995, vol. 1.[69] D. Fudenberg and J. Tirole,
Game theory . MIT press, 1991. A PPENDIX
A. Proof of Claim in Example 1
Define two pure strategies µ A and ˜ µ A of Team A as follows: µ A, ( x A, ) = x A, , µ A, ( x A, ) = − x A, , ˜ µ A, ( x A, ) = − x A, , ˜ µ A, ( x A, ) = x A, . Now, assume that Team A and Team B are restricted to useindependently randomized strategies (type 2 strategies definedin Section II-B). We will show in two steps that there exist noequilibria within this class of strategies.
Step 1:
If Team A and Team B’s type 2 strategies form anequilibrium, then Team A is playing either µ A or ˜ µ A .Let p j ( x ) denote the probability that player (A, j ) plays U A,j = − x given X A,j = x . Define q j = 12 p j ( −
1) + 12 p j (+1) , i.e. the ex-ante probability that player (A, j ) “lies”.Then we have E [ r A ( X , U )] = q (1 − q ) + q (1 − q ) . nder an equilibrium, Team B will optimally respond toTeam A strategy’s described through ( p , p ) . We can find alower bound of Team B’s reward by fixing a strategy: Considerthe “random guess” strategy of Team B, where each of ( B, j ) (for j = 1 , ) chooses U B,j uniformly at random irrespectiveof U A and independent of the other team member. Team Bcan thus guarantee an expected reward of + = 1 givenany strategy of Team A. Since r A ( X , U ) = − r B ( X , U ) ,we conclude that Team A’s total reward in an equilibrium isupper bounded by q (1 − q ) + q (1 − q ) − − q q − (1 − q )(1 − q ) ≤ Let σ B denote the strategy of Team B. Let π j ( u , u ) denotethe probability that player ( B, j ) plays U B,j = − u j given U A, = u , U A, = u (i.e. the probability that player (B, j )believes that (A, j ) was “lying” hence guesses the opposite ofwhat was signaled). If Team A plays µ A , then the total rewardof Team A is J A ( µ A , σ B )= 1 − E [1 − π ( X A, , − X A, ) + π ( X A, , − X A, )]= 14 X x ∈{− , } ( − π ( x ) + π ( x )) . If Team A plays ˜ µ A , then the total reward of Team A is J A (˜ µ A , σ B )= 1 − E [ π ( − X A, , X A, ) + 1 − π ( − X A, , X A, )]= 14 X x ∈{− , } ( π ( x ) − π ( x )) . Observe that J A ( µ A , σ B )+ J A (˜ µ A , σ B ) = 0 . Hence for any σ B , either J A ( µ A , σ B ) ≥ or J A (˜ µ A , σ B ) ≥ . In particular,we can conclude that Team A’s total reward is at least 0 inany equilibrium.We have established both an upper bound and lower boundfor Team A’s total reward in an equilibrium. Hence we musthave − q q − (1 − q )(1 − q ) = 0 , which implies q = 0 , q = 1 or q = 1 , q = 0 . The formercase corresponds to Team A playing the pure strategy µ A , andthe latter to playing ˜ µ A . Step 2:
There does not exist equilibria where Team A plays µ A or ˜ µ A .Suppose that Team A plays µ A . Then the only best responseof Team B is to play U B, = U A, , U B, = − U A, . Then,Team A’s total reward is J A ( µ A , σ B ) = 1 − − − . IfTeam A deviate to ˜ µ A , then Team A can obtain a total rewardof +1 (remember that J A ( µ A , σ B )+ J A (˜ µ A , σ B ) = 0 for any σ B ). Hence Team A does not play µ A at equilibrium.Similar arguments apply to ˜ µ A , which completes the proof. B. Proof of Lemma 1
Given a pure strategy profile µ , define a pure coordinationstrategy profile ν by ν it ( h it , γ i t − ) = ( µ i,jt ( h it , · )) ( i,j ) ∈N i ∀ h it ∈ H it , γ i t − ∈ Γ i t − , ∀ i ∈ I . We first prove one side of the result by coupling twosystems, i.e. for every pure strategy profile µ , there existan equivalent coordination strategy profile ν . In one of thesystems, we assume that pure strategies are used. In the othersystem, we assume that the corresponding pure coordinationstrategies are used. The realizations of primitive randomvariables (i.e. ( X i ) i ∈I , ( W i,Xt , W i,Yt ) i ∈I ,t ∈T ) are assumed tobe the same for two systems. We proceed to show that therealizations of all system variables (i.e. ( X t , Y t , U t ) t ∈T ) willbe the same for both systems. As a result, the expected payoffsare the same for both systems. The other direction can beproved analogously.We prove that the realizations of ( X t , Y t , U t ) t ∈T are thesame by induction on time t . Induction Base : At t = 1 , the realizations of X are thesame for two systems by assumption. For the first system wehave U i,j = µ i,j ( X i,j ) , and for the second system we have Γ i = ν it ( H i ) = ( µ i,jt ( · )) ( i,j ) ∈N i ,U i,j = Γ i,j ( X i,j ) , which means that U i,j = µ i ( X i,j ) also holds in the secondsystem.Since ( W i,Y ) i ∈I are the same for both systems, Y i = ℓ i ( X i , U , W i,Y ) are the same for both systems. Induction Step : Suppose that X s , Y s , U s are the same forboth systems for all s < t . Now we prove it for t .First, since the realizations of X it − , U t − , W i,Xt − are thesame, we have X it = f it ( X it − , U t − , W i,Xt − ) to be the same for both systems.Consider the actions. For the first system U i,jt = µ i,jt ( H i,jt ) = µ i,jt ( H it , X i,jt − d +1: t ) . In the second system Γ it = ν it ( H it ) = ( µ i,jt ( H it , · )) ( i,j ) ∈N i U i,jt = Γ i,jt ( X i,jt − d +1: t ) , which means that U i,jt = µ i,jt ( H it , X i,jt − d +1: t ) . We conclude that U t has the same realization for two sys-tems since ( H it , X i,jt − d +1: t ) have the same realization by the in-duction hypothesis and the argument above. Since ( W i,Yt ) i ∈I are the same for both systems, Y it = ℓ it ( X it , U t , W i,Yt ) aresame for both systems.Therefore we have established the induction step, provingthat for every pure strategy profile µ there exists an equivalentcoordination strategy profile ν .To complete the other half of the proof, for each givencoordination strategy ν we define µ i,jt ( h i,jt ) = γ i,jt ( x i,jt − d +1: t ) ∀ h i,jt ∈ H i,jt , here γ it = ( γ i,jt ) ( i,j ) ∈N i is recursively defined by ν i t and h it through γ it = ν it ( h it , γ i t − ) ∀ t ∈ T . Then using a similar argument we can show that µ isequivalent to ν . C. Proof of Lemma 3
Induction on time t . Induction Base: At t = 1 , we have X k to be independentfor different k because of the assumption on primitive randomvariables. Furthermore, since H k is a deterministic randomvector (see Remark 2) and the randomization of differentcoordinators are independent, we conclude that ( X k , Γ k ) aremutually independent for different k . The distribution of ( X k , Γ k ) depends on g only through g k . Induction Step:
Suppose that ( X k t , Γ k t ) are conditionallyindependent given H t and P g ( X k t , Γ k t | H t ) depends on g only through g k . Now, we have P g ( x t +1 , γ t +1 | h t +1 )= P g ( x t +1 | h t +1 , x t , γ t +1 ) × P g ( γ t +1 | h t +1 , x t , γ t ) ×× P g ( x t , γ t | h t +1 )= Y k ∈I P ( x kt +1 | x kt , u t ) g kt +1 ( γ kt +1 | h t +1 , x k t − d +1 , γ k t ) ! ×× P g ( x t , γ t | h t +1 ) . We then claim that P g ( x t , γ t , y t , u t | h t ) = Y k ∈I F kt ( x k t , γ k t , h t +1 ) where for each k ∈ I , F kt is a function that depends only on g k .To establish the claim we note that P g ( x t , γ t , y t , u t | h t )= P g ( y t , u t | h t , x t , γ t ) P g ( x t , γ t | h t )= Y k ∈I P ( y kt | x kt , u t ) { u kt = γ kt ( x kt − d +1: t ) } ! P g ( x t , γ t | h t )= Y k ∈I P ( y kt | x kt , u t ) { u kt = γ kt ( x kt − d +1: t ) } ! ×× Y k ∈I P g k ( x k t , γ k t | h t ) ! = Y k ∈I F kt ( x k t , γ k t , h t +1 ) , where in the third step we have used the induction hypothesis.Given the claim, we have P g ( x t , γ t | h t +1 )= P g ( x t , γ t , y t , u t | h t ) P ˜ x t , ˜ γ t P g (˜ x t , ˜ γ t , y t , u t | h t )= Q k ∈I F kt ( x k t , γ k t , h t +1 ) P ˜ x t , ˜ γ t Q k ∈I F kt (˜ x k t , ˜ γ k t , h t +1 ) = Q k ∈I F kt ( x k t , γ k t , h t +1 ) Q k ∈I (cid:16)P ˜ x k t , ˜ γ k t F kt (˜ x k t , ˜ γ k t , h t +1 ) (cid:17) = Y k ∈I F kt ( x k t , γ k t , h t +1 ) P ˜ x k t , ˜ γ k t F kt (˜ x k t , ˜ γ k t , h t +1 ) ! and then P g ( x t +1 , γ t +1 | h t +1 ) = Y k ∈I G kt ( x k t +1 , γ k t +1 , h t +1 ) , where G kt is given by G kt ( x k t +1 , γ k t +1 , h t +1 )= P ( x kt +1 | x kt , u t ) g kt +1 ( γ kt +1 | h t +1 , x k t − d +1 , γ k t ) ×× F kt ( x k t , γ k t , h t +1 ) P ˜ x k t , ˜ γ k t F kt (˜ x k t , ˜ γ k t , h t +1 ) . One can check that G kt depends on g only through g k and P ˜ x k t +1 , ˜ γ k t +1 G kt (˜ x k t +1 , ˜ γ k t +1 , h t +1 ) = 1 , therefore G kt ( x k t +1 , γ k t +1 , h t +1 ) = P g k ( x k t +1 , γ k t +1 | h t +1 ) . Hence we establish the induction step.
D. Proof of Lemma 5
Assume that h it ∈ H it is admissible under g . From Lemma3, we know that P g ( x i t , γ i t | h t ) does not depend on g − i .As a conditional distribution obtained from P g ( x i t , γ i t | h t ) , P g ( x it − d +1: t | h it ) does not depend on g − i either.Therefore, we can compute the belief of coordinator i byreplacing g − i with ˆ g − i , which is an open-loop strategy profilethat always generates the actions u − i t − . P g i ,g − i ( x it − d +1: t | h it ) = P g i , ˆ g − i ( x it − d +1: t | h it ) . Note that we always have P g i , ˆ g − i ( h it ) > for all h it admissibleunder g .Furthermore, we can also introduce additional random vari-ables into the condition that are conditionally independentaccording to Lemma 3, i.e. P g i , ˆ g − i ( x it − d +1: t | h it ) = P g i , ˆ g − i ( x it − d +1: t | h it , x − it − d : t ) , where x − it − d : t ∈ X − it − d : t is such that P g i , ˆ g − i ( x − it − d : t | h it ) > .Let τ = t − d + 1 . By Bayes’ rule P g i , ˆ g − i ( x iτ : t | h it , x − iτ − t )= P g i , ˆ g − i ( x τ : t , y τ : t − , u τ : t − , γ iτ : t − | h ∗ iτ ) P ˜ x iτ : t P g i , ˆ g − i (˜ x iτ : t , x − iτ : t , y τ : t − , u τ : t − , γ iτ : t − | h ∗ iτ ) , (10)where h ∗ iτ = ( y τ − , u τ − , x i τ − , x − iτ − , γ i τ − ) . We have P g i , ˆ g − i ( x τ : t , y τ : t − , u τ : t − , γ iτ : t − | h ∗ iτ ) = d − Y l =1 h P g i , ˆ g − i ( x t − l +1 , y t − l | h ∗ iτ , x τ : t − l , y τ : t − l − , u τ : t − l , γ iτ : t − l ) × P g i , ˆ g − i ( u it − l | h ∗ iτ , x τ : t − l , y τ : t − l − , u τ : t − l − , γ iτ : t − l ) ×× P g i , ˆ g − i ( γ it − l | h ∗ iτ , x τ : t − l , y τ : t − l − , u τ : t − l − , γ iτ : t − l − ) i ×× P g i , ˆ g − i ( x τ | h ∗ iτ ) . (11)The first three terms in the above product are P g i , ˆ g − i ( x t − l +1 , y t − l | h ∗ iτ , x τ : t − l , y τ : t − l − , u τ : t − l , γ iτ : t − l )= Y k ∈I [ P ( x kt − l +1 | x kt − l , u t − l ) P ( y kt − l | x kt − l , u t − l )] , P g i , ˆ g − i ( u it − l | h ∗ iτ , x τ : t − l , y τ : t − l − , u τ : t − l − , γ iτ : t − l )= Y ( i,j ) ∈N i { u i,jt − l = γ i,jt − l ( x i,jt − l − d +1: t − l ) } = Y ( i,j ) ∈N i { u i,jt − l = φ it − l,l ( x it − d +1: t − l ) } , P g i , ˆ g − i ( γ it − l | h ∗ iτ , x τ : t − l , y τ : t − l − , u τ : t − l − , γ iτ : t − l − )= g it − l ( γ it − l | y t − l − , u t − l − , x i t − d − ( l − , γ i t − l − ) , (12)respectively.The last term satisfies P g i , ˆ g − i ( x τ | h ∗ iτ ) = Y k ∈I P ( x kτ | x kτ − , u τ − ) . Substituting (11) - (12) into (10) we obtain P g i , ˆ g − i ( x iτ : t | h it , x − iτ − t )= F it ( x iτ : t , y iτ : t − , u τ − t − , x iτ − , φ it ) P ˜ x iτ : t F it (˜ x iτ : t , y iτ : t − , u τ − t − , x iτ − , φ it ) where F it ( x iτ : t , y iτ : t − , u τ − t − , φ it ):= P ( x iτ | x iτ − , u τ − ) d − Y l =1 h P ( x it − l +1 | x it − l , u t − l ) ×× P ( y it − l | x it − l , u t − l ) Y ( i,j ) ∈N i { u it − l = φ it − l,l ( x it − d +1: t − l ) } i Therefore we have proved that P g ( x it − d +1: t | h it )= P it ( x it − d +1: t | y it − d +1: t − , u t − d : t − , x it − d , φ it ):= F it ( x it − d +1: t , y it − d +1: t − , u t − d : t − , x it − d , φ it ) P ˜ x it − d +1: t F it (˜ x it − d +1: t , y it − d +1: t − , u t − d : t − , x it − d , φ it ) where P it is independent of g . E. Proof of Lemma 6
Let ˜ g i denote coordinator i ’s behavioral coordination strat-egy. Because of Lemma 3 we have P ˜ g i ,g − i ( x t − d +1: t , γ − it | h it , γ it )= P ˜ g i ,g − i ( x t − d +1: t , γ − it | h t , x i t − d , γ i t )= P ˜ g i ( x it − d +1: t | h t , x i t − d , γ i t ) Y k = i P g k ( x kt − d +1: t , γ kt | h t ) . We know that Γ it and X it − d +1: t are conditionally indepen-dent given H it since Γ it is chosen as a randomized function of H it at a time when X it − d +1: t are already realized. Therefore, P ˜ g i ,g − i ( x it − d +1: t | h t , x i t − d , γ i t )= P ˜ g i ,g − i ( x it − d +1: t | h t , x i t − d , γ i t − )= P it ( x it − d : t | y it − d +1: t − , u t − d : t − , x it − d , φ it ) , where P it is the belief function defined in Eq. (1).We conclude that P ˜ g i ,g − i ( x t − d +1: t , γ − it | h t , x i t − d , γ i t − )= F it ( x t − d +1: t , γ − it | h t , x it − d , φ it ; g − i ) (13)for some function F it that does not depend on ˜ g i .Consider the reward of coordinator i . By the law of iteratedexpectation we can write J i (˜ g i , g − i ) = E ˜ g i ,g − i "X t ∈T E ˜ g i ,g − i [ r it ( X t , U t ) | H it , Γ it ] . For each term we have E ˜ g i ,g − i [ r it ( X t , U t ) | h it , γ it ]= X ˜ x t − d +1: t X ˜ γ − it r it (˜ x t , ( γ it (˜ x it − d +1: t ) , ˜ γ − it (˜ x − it − d +1: t ))) ×× F it (˜ x t − d +1: t , ˜ γ − it | h t , x it − d , φ it ; g − i )=: r it ( h t , x it − d , φ it , γ it ; g − i ) where F it is the belief function described in (13), and r it is afunction that does not depend on ˜ g i .We claim that ( H t , X it − d , Φ it ) is a controlled Markovprocess controlled by coordinator i ’s prescriptions fixing theother coordinators’ strategies. We need to prove that P ˜ g i ,g − i ( h t +1 , x it − d +1 , φ it +1 | h t , x i t − d , φ i t , γ i t )= G it ( h t +1 , x it − d +1 , φ it +1 | h t , x it − d , φ it , γ it ) for some function G it independent of ˜ g i .We know that H t +1 = ( H t , Y t , U t ) and Y kt = ℓ kt ( X kt , U t , W k,Yt ) ∀ k ∈ I ,U k,jt = Γ k,jt ( X k,jt − d +1: t ) ∀ ( k, j ) ∈ N , Φ it +1 = (Φ i,jt +1 − s,s ) ( i,j ) ∈N i , ≤ s ≤ d − Φ i,jt, = Γ i,jt ( X i,jt − d +1 , · ) ∀ ( i, j ) ∈ N i , Φ i,jt +1 − s,s = Φ i,jt +1 − s,s − ( X i,jt − d +1 , · ) ∀ ( i, j ) ∈ N i , s ≥ , hence ( H t +1 , X it − d +1 , Φ it +1 ) is a function of ( H t , X t − d +1: t , Γ t , Φ it ) , and W Yt . As W Yt is a primitiverandom vector independent of ( H t , X i t − d , Φ i t , Γ i t ) , itsuffices to prove that P ˜ g i ,g − i ( x t − d +1: t , γ − it | h t , x i t − d , φ i t , γ i t )= G it ( x t − d +1: t , γ − it | h t , x it − d , φ it , γ it ) for some function G it independent of ˜ g i .ince ( H t , X i t − d , Φ i t , Γ i t ) is a function of ( H it , Γ it ) ,applying smoothing property of conditional expectation toboth sides of (13) we obtain P ˜ g i ,g − i ( x t − d +1: t , γ − it | h t , x i t − d , φ i t , γ i t )= F it ( x t − d +1: t , γ − it | h t , x it − d , φ it ; g − i ) . Hence, we conclude that coordinator i faces a MarkovDecision Problem where the state process is ( H t , X it − d , Φ it ) ,the control action is Γ it , and the total reward is E ˜ g i ,g − i "X t ∈T r it ( H t , X it − d , Φ it , Γ it ; g − i ) . By standard MDP theory, coordinator i can form a bestresponse by choosing Γ it based on ( H t , X it − d , Φ it ) . F. Proof of Theorem 1
The idea to prove the theorem is to apply Kakutani’sfixed point theorem on a special best response correspondencedefined through Bellman equations.Define Ξ it ⊂ H t × S it to be the set of admissible ( h t , s it ) ’s,i.e. ( h t , s it ) ’s with strictly positive probability under at leastone strategy profile of the coordinators.For ǫ ≥ , define R ǫ,i be the set of SPIB strategy profilesfor coordinator i where each prescription has probability atleast ǫ to be chosen at any information set. Specifically, itsuffices to consider the prescription choices for ( h t , s it ) ∈ Ξ it for each t ∈ T , and we can write R ǫ,i = Y t ∈T Y ξ it ∈ Ξ it ∆ ǫ ( Γ it ) where ∆ ǫ ( Γ it ) = { η ∈ ∆( Γ it ) : η ( γ it ) ≥ ǫ ∀ γ it ∈ Γ it } . We also define R ǫ = Q i ∈I R ǫ,i . R is then the set of allSPIB strategy profiles.Recall that in the proof of Lemma 6, we have shownthat fixing a behavioral strategy coordination profile g − i ,coordinator i faces an MDP problem with state Ξ it := ( H t , S it ) and control action Γ it and total reward E "X t ∈T r it ( H t , S it , Γ it ; g − i ) , where r it is defined in E.With some abuse of notation, let r it (Ξ it , Γ it ; ρ − i ) denote theinstantaneous cost when all coordinators except i play SPIBstrategy profile ρ − i .Hence we can define a subset of the best response corre-spondence through the following construction: For each ξ it ∈ Ξ it , define the correspondence BR ǫ,it [ ξ it ] : R ǫ, − i ∆ ǫ ( Γ it ) sequentially through Q ǫ,iT ( ξ iT , γ iT ; ρ − i ) := r iT ( ξ iT , γ iT ; ρ − i ) , and for each t ∈ T and each ξ it ∈ Ξ it , BR ǫ,it [ ξ it ]( ρ − i ) := arg max η ∈ ∆ ǫ ( Γ it ) X γ it η ( γ it ) Q ǫ,it ( ξ it , γ it ; ρ − i ) , V ǫ,it ( ξ it ; ρ − i ) := max η ∈ ∆ ǫ ( Γ it ) X γ it η ( γ it ) Q ǫ,it ( ξ it , γ it ; ρ − i ) ,Q ǫ,it − ( ξ it − , γ it − ; ρ − i ) := r it − ( ξ it − , γ it − ; ρ − i )+ X ξ it V ǫ,it ( ξ it ; ρ − i ) P ρ − i ( ξ it | ξ it − , γ it ) . Define BR ǫ : R ǫ
7→ R ǫ by BR ǫ ( ρ )= { ˜ g ∈ R ǫ : ˜ g it ( ξ it ) ∈ BR ǫ,i [ ξ it ]( ρ − i ) ∀ ξ it ∈ Ξ it , ∀ i ∈ I} = Y i ∈I Y t ∈T Y ξ it ∈ Ξ it BR ǫ,it [ ξ it ]( ρ − i ) . Claim :(a) r it ( ξ it , γ it ; ρ − i ) is continuous in ρ − i on R ǫ, − i for all t ∈ T and all ξ it ∈ Ξ it , γ it ∈ Γ it (b) P ρ − i ( ξ it +1 | ξ it , γ it ) is continuous in ρ − i on R ǫ, − i for all t ∈ T \{ T } and all ξ it +1 ∈ Ξ it +1 , ξ it ∈ Ξ it , γ it ∈ Γ it .Given the claims, we prove by induction that Q ǫ,it ( ξ it , γ it ; · ) is continuous on R ǫ, − i for each ξ it ∈ Ξ it and γ it ∈ Γ it . Induction Base : Q ǫ,iT ( ξ iT , γ iT ; · ) is continuous on R ǫ, − i since r iT ( ξ iT , γ iT ; ρ − i ) is continuous in ρ − i on R ǫ, − i for all ξ iT ∈ Ξ iT and all γ iT ∈ Γ iT . Induction Step : Suppose that the induction hypothesis istrue for t . Then V ǫ,it ( ξ it ; · ) is continuous on R ǫ, − i due toBerge’s Maximum Theorem. Then for all ξ it − ∈ Ξ it − and γ it − ∈ Γ it − , Q ǫ,it − ( ξ it − , γ it − ; · ) is continuous on R ǫ, − i since r it − ( ξ it − , γ it − ; ρ − i ) is continuous in ρ − i on R ǫ, − i ,and the transition probability P ρ − i ( ξ it | ξ it − , γ it − ) is alsocontinuous in ρ − i on R ǫ, − i .Because of Berge’s Maximum Theorem, we conclude that BR ǫ,i [ ξ it ] is upper hemicontinuous on R ǫ, − i for each ξ it ∈ Ξ it . BR ǫ,i [ ξ it ]( ρ − i ) is also non-empty and convex for each ρ − i ∈R ǫ, − i since it is a solution set of a linear program.As a product of compact-valued upper hemicontinuouscorrespondences, we know that BR ǫ is upper hemicontinu-ous. Furthermore, BR ǫ ( ρ ) is non-empty and convex for each ρ ∈ R ǫ . By Kakutani’s fixed point theorem, BR ǫ has a fixedpoint.Let ǫ n ց . Let ρ ( n ) ∈ R ǫ n be a fixed point of BR ǫ n . Thenfor each i ∈ I we have ρ ( n ) ,i ∈ arg max ρ i ∈R ǫn,i J i ( ρ i , ρ ( n ) , − i ) where J i ( ρ ) = E ρ "X t ∈T r it ( X t , U t ) = E ρ i "X t ∈T r it ( S it , Γ it ; ρ − i ) . Let ρ ( ∞ ) ∈ R be the limit of some sub-sequence of ( ρ ( n ) ) n ∈ N . Since J i ( · ) is continuous on R and ǫ
7→ R ǫ,i isa continuous correspondence with compact, non-empty value(for small enough ǫ ), by Berge’s Maximum Theorem, weconclude that for each i , ρ ( ∞ ) ,i ∈ arg max ρ i ∈R ,i J i ( ρ i , ρ ( ∞ ) , − i ) . .e. ρ ( ∞ ) ,i is one of the optimal strategies among SPIBstrategies to respond to ρ ( ∞ ) , − i . Combining with Lemma 6which states that there always exists best response strategiesthat are SPIB strategies, we conclude that ρ ( ∞ ) forms a CNE,proving the result. Proof of Claim.
We first notice that, by the proof of Lemma 6,both r it ( ξ it , γ it ; ρ − i ) and P ρ − i ( ξ it +1 | ξ it , γ it ) are linear functionsof F it ( x t − d +1: t , γ − it | ξ it ; ρ − i ) (defined in (13)). We have F it ( x t − d +1: t , γ − it | ξ it ; ρ − i )= P ρ − i ( x t − d +1: t , γ − it | ξ it ) = P ˆ ρ i ,ρ − i ( x t − d +1: t , γ − it , ξ it ) P ˆ ρ i ,ρ − i ( ξ it ) where ˆ ρ i ∈ R ǫ,i is a fixed, arbitrary SPIB strategy. We knowthat both P ˆ ρ i ,ρ − i ( x t − d +1: t , γ − it , ξ it ) and P ˆ ρ i ,ρ − i ( ξ it ) are sumsof products of components of ρ − i and ˆ ρ i , hence both arecontinuous in ρ − i . Furthermore, we have P ˆ ρ i ,ρ − i ( ξ it ) > forall ρ − i ∈ R ǫ, − i since ξ it ∈ Ξ it has strictly positive probabil-ity under some strategy profile, and (ˆ ρ i , ρ − i ) is a strategyprofile that chooses strictly mixed prescriptions. Therefore F it ( x t − d +1: t , γ − it | ξ it ; ρ − i ) is continuous in ρ − i on R ǫ, − i . G. Proof of Lemma 7
We will prove a stronger result which we need in the proofof Proposition 3.
Lemma 10.
Let ( λ ∗ k , ψ ∗ ) be a CIB strategy such that ψ ∗ ,k isconsistent with λ ∗ k . Let g ∗ k be the behavioral strategy profilegenerated from ( λ ∗ k , ψ ∗ ) . Let π kt represent the belief on S kt generated by ψ ∗ at time t based on h t . Let t < τ . Considera fixed h τ ∈ H τ and some ˜ g k t − (not necessarily equal to g ∗ k t − ). Assume that h τ is admissible under (˜ g k t − , g ∗ kt : τ − ) .Suppose that P ˜ g k t − ( s kt , x kt − d +1: t | h t )= π kt ( s kt ) P kt ( x kt − d +1: t | y kt − d +1: t − , u t − d : t − , s kt ) ∀ s kt ∈ S kt ∀ x kt − d +1: t ∈ X kt − d +1: t . (14) Then P ˜ g k t − ,g ∗ kt : τ − ( s kτ , x kτ − d +1: τ | h τ )= π kτ ( s kτ ) P kτ ( x kτ − d +1: τ | y kτ − d +1: τ − , u τ − d : τ − , s kτ ) ∀ s kτ ∈ S kτ ∀ x kτ − d +1: τ ∈ X kτ − d +1: τ . The assertion of Lemma 7 follows from Lemma 10 and thefact that (14) is true for t = 1 . Proof of Lemma 10.
We only need to prove the result for τ = t + 1 .Since h t +1 is admissible under (˜ g k t − , g ∗ kt ) , we have P ˜ g k t − ,g ∗ kt , ˆ g − k t ( h t +1 ) > (15)where ˆ g − k t is the open-loop strategy where all coordinatorsexcept k choose prescriptions that generate the actions u − k t .From Lemma 3 we know that P ˜ g k t − ,g ∗ kt ,g − k ( s kt +1 | h t +1 ) isindependent of g − k . Therefore P ˜ g k t − ,g ∗ kt ( s kt +1 | h t +1 ) = P ˜ g k t − ,g ∗ kt , ˆ g − k t ( s kt +1 , y t , u t | h t ) P ˜ s kt +1 P ˜ g k t − ,g ∗ kt , ˆ g − k t (˜ s kt +1 , y t , u t | h t ) , (16)and the denominator of (16) is non-zero due to (15).We have P ˜ g k t − ,g ∗ kt , ˆ g − k t ( s kt +1 , y t , u t | h t )= X ˜ s kt X ˜ x kt − d +1: t X ˜ x − kt X ˜ γ kt :˜ γ kt (˜ x kt − d +1: t )= u kt h P ( y kt | ˜ x kt , u t ) ×× P ( y − kt | ˜ x − kt , u t ) { s kt +1 = ι kt (˜ s kt , ˜ x kt − d +1 , ˜ γ kt ) } λ ∗ kt (˜ γ kt | b t , ˜ s kt ) ×× P ˜ g k t − ,g ∗ kt , ˆ g − k t (˜ x kt − d +1: t , ˜ x − kt , ˜ s kt | h t ) i = X ˜ s kt X ˜ x kt − d +1: t X ˜ x − kt X ˜ γ kt :˜ γ kt (˜ x kt − d +1: t )= u kt h P ( y kt | ˜ x kt , u t ) ×× P ( y − kt | ˜ x − kt , u t ) { s kt +1 = ι kt (˜ s kt , ˜ x kt − d +1 , ˜ γ kt ) } λ ∗ kt (˜ γ kt | b t , ˜ s kt ) ×× P ˜ g k t − ,g ∗ kt , ˆ g − k t (˜ x kt − d +1: t , ˜ s kt | h t ) P ˜ g k t − ,g ∗ kt , ˆ g − k t (˜ x − kt | h t ) i = X ˜ x − kt P ( y − kt | ˜ x − kt , u t ) P ˜ g k t − ,g ∗ kt , ˆ g − k t (˜ x − kt | h t ) ×× X ˜ s kt X ˜ x kt − d +1: t X ˜ γ kt :˜ γ kt (˜ x kt − d +1: t )= u kt h P ( y kt | ˜ x kt , u t ) ×× { s kt +1 = ι kt (˜ s kt , ˜ x kt − d +1 , ˜ γ kt ) } λ ∗ kt (˜ γ kt | b t , ˜ s kt ) ×× P ˜ g k t − ,g ∗ kt , ˆ g − k t (˜ x kt − d +1: t , ˜ s kt | h t ) i . (17)Using (16) and (17) we obtain P ˜ g k t − ,g ∗ kt ( s kt +1 | h t +1 )= Υ kt ( b t , y kt , u t , s kt +1 ) P ˜ s kt +1 Υ kt ( b t , y kt , u t , ˜ s kt +1 ) where Υ kt ( b t , y kt , u t , s kt +1 )= X ˜ s kt X ˜ x kt − d +1: t X ˜ γ kt :˜ γ kt (˜ x kt − d +1: t )= u kt h P ( y kt | ˜ x kt , u t ) ×× { s kt +1 = ι kt (˜ s kt , ˜ x kt − d +1 , ˜ γ kt ) } λ ∗ kt (˜ γ kt | b t , ˜ s kt ) ×× P kt (˜ x kt − d +1: t | y kt − d +1: t − , u t − d : t − , s kt ) π kt ( s kt ) i , Therefore by the definition of consistency of ψ ∗ ,k with respectto λ ∗ k , we conclude that P ˜ g k t − ,g ∗ kt ( s kt +1 | h t +1 ) = π kt +1 ( s kt +1 ) . Now consider P ˜ g k t − ,g ∗ kt (˜ x kt − d +2: t +1 , s kt +1 | h t +1 ) . • If P ˜ g k t − ,g ∗ kt ( s kt +1 | h t +1 ) = 0 then we have π kt +1 ( s kt +1 ) =0 and P ˜ g k t − ,g ∗ kt (˜ x kt − d +2: t +1 , s kt +1 | h t +1 ) = 0 . • If P ˜ g k t − ,g ∗ kt ( s kt +1 | h t +1 ) > then P ˜ g k t − ,g ∗ kt (˜ x kt − d +2: t +1 , s kt +1 | h t +1 )= P ˜ g k t − ,g ∗ kt (˜ x kt − d +1: t | h t +1 , s kt +1 ) π kt +1 ( s kt +1 ) . e have shown in Lemma 5 that P ˜ g k t − ,g ∗ kt (˜ x kt − d +2: t +1 | h kt +1 )= P kt +1 (˜ x kt − d +2: t +1 | y kt − d +2: t , u t − d +1: t , s kt +1 ) and ( h t +1 , s kt +1 ) is a function of h kt +1 . By the law ofiterated expectation we have P ˜ g k t − ,g ∗ kt , ˆ g − k t (˜ x kt − d +2: t +1 | h t +1 , s kt +1 )= P kt +1 (˜ x kt − d +2: t +1 | y kt − d +2: t , u t − d +1: t , s kt +1 ) . We conclude that P ˜ g k t − ,g ∗ kt (˜ x kt − d +2: t +1 , s kt +1 | h t +1 )= P kt (˜ x kt − d +2: t +1 | y kt − d +2: t , u t − d +1: t , s kt +1 ) π kt +1 ( s kt +1 ) for all s kt +1 ∈ S kt +1 and all x kt − d +2: t +1 ∈ X kt − d +2: t +1 . H. Proof of Lemma 8
Let g − i denote the behavioral strategy profile of all coordi-nators other than i generated from the CIB strategy profile ( λ k , ψ k ) k ∈I\{ i } . Let ( h it , γ it ) be admissible under g − i . Wehave shown in the proof of Lemma 6 that P g − i ( x t − d +1: t , γ − it | h it , γ it )= P it ( x it − d : t | y it − d +1: t − , u t − d : t − , x it − d , φ it ) ×× Y k = i P g k ( x kt − d +1: t , γ kt | h t ) . (18)where P it is the belief function defined in Eq. (1).Since all coordinators other than coordinator i are usingthe same belief generation systems, we have B jt = B kt for j, k = i . Denote B t = B kt for all k ∈ I\{ i } . Let b t = (cid:18)(cid:16) π ∗ ,lt (cid:17) l ∈I , y t − d +1: t − , u t − d : t − (cid:19) be a realization of B t . Also define ψ ∗ = ψ k for all k = i .Consider k = i . Coordinator k ’s strategy g k is a self-consistent CIB strategy. We also have h t admissible under g k since ( h it , γ it ) is admissible under g − i . Hence applying Lemma7 we have P g k (˜ s kt , x kt − d +1: t | h t )= π ∗ ,kt (˜ s kt ) P kt ( x kt − d +1: t | y kt − d +1: t − , u t − d : t − , ˜ s kt ) Hence the second term of the right hand side of (18) satisfies P g k ( x kt − d +1: t , γ kt | h t ) = X ˜ s kt P g k (˜ s kt , x kt − d +1: t , γ kt | h t )= X ˜ s kt h π ∗ ,kt (˜ s kt ) P kt ( x kt − d +1: t | y kt − d +1: t − , u t − d : t − , ˜ s kt ) ×× λ kt ( γ kt | b t , ˜ s kt ) i , (19)where P kt is the belief function defined in Eq. (1).Recall that b t = (cid:18)(cid:16) π ∗ ,lt (cid:17) l ∈I , y t − d +1: t − , u t − d : t − (cid:19) . From(18) and (19) We conclude that P g − i ( x t − d +1: t , γ − it | h it , γ it ) = F it ( x t − d +1: t , γ − it | b t , s it ) (20)for some function F it for all ( h it , γ it ) admissible under g − i .Consider the total reward of coordinator i . By the law ofiterated expectation we can write J i (˜ g i , g − i ) = E ˜ g i ,g − i "X t ∈T E g − i [ r it ( X t , U t ) | H it , Γ it ] . For ( h it , γ it ) admissible under g − i , E g − i [ r it ( X t , U t ) | h it , γ it ]= X ˜ x t − d +1: t X ˜ γ − it r it (˜ x t , ( γ it (˜ x it − d +1: t ) , ˜ γ − it (˜ x − it − d +1: t ))) ×× F it (˜ x t − d +1: t , ˜ γ − it | b t , s it )= r it ( b t , s it , γ it ) , for some function r it that depends on g − i (specifically, on λ − it )but not on ˜ g i .We claim that ( B t , S it ) is a controlled Markov processcontrolled by coordinator i ’s prescriptions, given that othercoordinators are using the strategy profile g − i . Let ˜ g i denotean arbitrary strategy for coordinator i (not necessarily a CIBstrategy). We need to prove that P ˜ g i ,g − i ( b t +1 , s it +1 | b t , s i t , γ i t )= Ξ it ( b t +1 , s it | b t , s it , γ it ) ∀ ( b t , s i t , γ i t ) s . t . P ˜ g i ,g − i ( b t , s i t , γ i t ) > for some function Ξ it independent of ˜ g i .We know that B t +1 = ( Π t +1 , Y t − d +2: t , U t − d +1: t ) , Π t +1 = ψ ∗ t ( B t , Y t , U t ) ,Y kt = ℓ kt ( X kt , U t , W k,Yt ) ∀ k ∈ I ,U k,jt = Γ k,jt ( X k,jt − d +1: t ) ∀ ( k, j ) ∈ N ,S it +1 = ι it ( S it , X it − d +1 , Γ it ) . Hence ( B t +1 , S it ) is a fixed function of ( B t , S it , X t − d +1: t , Γ t , W Yt ) , where W Yt is a primitiverandom vector independent of ( B t , S i t , Γ i t , X t − d +1: t ) .Therefore, it suffices to prove that P ˜ g i ,g − i ( x t − d +1: t , γ − it | b t , s i t , γ i t )= Ξ it ( x t − d +1: t , γ − it | b t , s it , γ it ) for some function Ξ it independent of ˜ g i . ( B t , S i t , Γ i t ) is a function of ( H it , Γ it ) . Therefore, byapplying smoothing property of conditional expectations toboth sides of (20) we obtain P ˜ g i ,g − i ( x t − d +1: t , γ − it | b t , s i t , γ i t )= F it ( x t − d +1: t , γ − it | b t , s it ) , where we know that F it , as defined in (20), is independent of ˜ g i .e conclude that coordinator i faces a Markov DecisionProblem where the state process is ( B t , S it ) , the control actionis Γ it , and the total reward is E "X t ∈T r it ( B t , S it , Γ it ) . By standard MDP theory, coordinator i can form a bestresponse by choosing Γ it as a function of ( B t , S it ) . I. Proof of Theorem 2
Let ( λ ∗ , ψ ∗ ) be a pair that solves the dynamic programdefined in the statement of the theorem. Let g ∗ k denote thebehavioral coordination strategy corresponding to ( λ ∗ k , ψ ∗ ) for k ∈ I . We only need to show the following: Suppose thatthe coordinators other than coordinator i play g ∗− i , then g ∗ i is a best response to g ∗− i .Let h t ∈ H t be admissible under g ∗− i . Then P g ∗ k ( s kt , x kt − d +1: t | h t )= π kt ( s kt ) P kt ( x kt − d +1: t | y kt − d +1: t − , u t − d : t − , s kt ) (21)for all k = i by Lemma 7, where π kt is the belief generatedby ψ ∗ when h t occurs.By Lemma 5 we also have P (˜ s it , ˜ x it − d +1: t | h t , s it )= P it (˜ x it − d +1: t | y it − d +1: t − , u t − d : t − , ˜ s it ) (22)Combining (21) and (22), the belief for coordinator i definedin the stage game according to Definition 13 satisfies β it (˜ z t | s it )= { ˜ s it = s it } Y k = i π kt (˜ s kt ) ×× Y k ∈I P kt (˜ x kt − d +1: t | y kt − d +1: t − , u t − d : t − , ˜ s kt ) ! P ( ˜ w Yt )= P (˜ s it , ˜ x it − d +1: t | h t , s it ) Y k = i P g ∗ k (˜ s kt , ˜ x kt − d +1: t | h t ) P ( ˜ w Yt )= P g ∗− i (˜ s t , ˜ x t − d +1: t | h t , s it ) P ( ˜ w k,Yt ) = P g ∗− i (˜ z t | h t , s it ) for all ( h t , s it ) admissible under g ∗− i , i.e. the belief representsa true conditional distribution. Since β it ( ·| s it ) is a fixed functionof ( b t , s it ) , by applying smoothing property on both sides ofthe above equation we can obtain β it (˜ z t | s it ) = P g ∗− i (˜ z t | b t , s it ) . for all ( b t , s it ) admissible under g ∗− i . Note that P g − i (˜ z t | b t , s it ) is different from β it (˜ z t | s it ) . Since B t is just acompression of the common information based on an predetermined updaterule ψ , which may or may not be consistent with the actually played strategy, B t may not represent the true belief. P g − i (˜ z t | b t , s it ) is the belief an agent inferred from the event B t = b t , S it = s it . The agent knows that b t mightnot contain the true belief, but it is useful anyway in inferring the true state. β it (˜ z t | s it ) is a conditional distribution computed with b t , pretending that b t contains the true belief. Then the interim expected utility considered in the definitionof IBNE correspondences (Definition 14) can be written as X ˜ z t , ˜ γ t η (˜ γ it ) Q it (˜ z t , ˜ γ t ) β it (˜ z t | x it − ) Y k = i λ ∗ kt (˜ γ kt | b t , ˜ s kt )= X ˜ γ it η (˜ γ it ) E g ∗− i t [ Q it ( Z t , Γ t ) | b t , s it , ˜ γ it ] . for all ( b t , s it ) admissible under g ∗− i .The condition of Theorem 2 then implies λ ∗ it ( b t , s it ) ∈ arg max η ∈ ∆( Γ it ) X ˜ γ t η (˜ γ it ) E g ∗− i [ r it ( X t , U t )++ V it +1 ( B t +1 , S it +1 ) | b t , s it , ˜ γ it ]; (23) V it ( b t , s it ) = X ˜ γ it h λ ∗ it (˜ γ it | b t , s it ) ×× E g ∗− i t [ r it ( X t , U t ) + V it +1 ( B t +1 , S it +1 ) | b t , s it , ˜ γ it ] i (24)for all ( b t , s it ) admissible under g ∗− i .Recall that in the proof of Lemma 8, we have already provedthat fixing ( λ ∗− i , ψ ∗ ) , ( B t , S it ) is a controlled Markov processcontrolled by Γ it . Hence (23) and (24) show that λ ∗ it is adynamic programming solution of the MDP with instantaneousreward r it ( B t , S it , Γ it ) := E g ∗− i [ r it ( X t , U t ) | B t , S it , Γ it ] . Therefore, λ ∗ i maximizes E λ i ,λ ∗− i "X t ∈T r it ( B t , S it , Γ it ) over all λ i = ( λ it ) t ∈T , λ it : B t × S it ∆( Γ it ) .Notice that for any λ i , if g i is the behavioral coordinationstrategy corresponding to the CIB strategy ( λ i , ψ ∗ t ) , then byLaw of Iterated Expectation E λ i ,λ ∗− i "X t ∈T r it ( B t , S it , Γ it ) = E g i ,g ∗− i "X t ∈T r it ( X t , U t ) . Hence we know that g ∗ i maximizes E g i ,g ∗− i "X t ∈T r it ( X t , U t ) over all g i generated from a CIB strategy with the beliefgeneration system ψ ∗ .By the closedness property of CIB strategies (Lemma 8), weconclude that g ∗ i is a best response to g ∗− i over all behavioralcoordination strategies of coordinator i , proving the result. J. Proof of Proposition 1
We will characterize all the Bayes-Nash Equilibria of Ex-ample 3 in terms of individual players’ behavioral strategies.Then we will show that none of the BNE correspond to aCIB-CNE.Let p = ( p , p ) ∈ [0 , describe Alice’s behavioralstrategy: p is the probability that Alice plays U A = − given X A = − ; p is the probability that Alice plays A = +1 given X A = +1 . Let q = ( q , q ) ∈ [0 , denoteBob’s behavioral strategy: q is the probability that Bob plays U B = L when observing U A = − , q is the probability thatBob plays U B = L when observing U A = +1 . Claim: p ∗ = (cid:18) , (cid:19) , q ∗ = (cid:18)
13 + ε, − ε (cid:19) is the unique BNE of Example 3.Given the claim, one can conclude that a CIB-CNE doesnot exist in this game: Suppose that ( λ ∗ , ψ ∗ ) forms a CIB-CNE, Then by the definition of CIB strategies, at t = 1 theteam of Alice chooses a prescription (which maps X A to U A )based on no information. At t = 3 , the team of Bob choosesa prescription (which is equivalent to an action since Bob hasno state) based solely on B . Define the induced behavioralstrategy of Alice and Bob through p = λ ∗ A ( id | ∅ ) + λ ∗ A ( cp − | ∅ ) ,p = λ ∗ A ( id | ∅ ) + λ ∗ A ( cp +1 | ∅ ) ,q = λ ∗ B ( L | b [ − ,q = λ ∗ B ( L | b [+1]) , where b [ u ] is the CCI under belief generation system ψ ∗ when U A = u . id is the prescription that chooses U A = X A ; cp u is the prescription that chooses U A = u irrespective of X A ; L is Bob’s prescription that chooses U B = L .The consistency of ψ ∗ with respect to λ ∗ implies that Π ( −
1) = p p + 1 − p if p = (0 , , U = − , Π (+1) = p p + 1 − p if p = (1 , , U = +1 , The consistency of ψ ∗ with respect to λ ∗ implies that Π (+1) = Π ( U A ) . If a CIB-CNE induces behavioral strategy p ∗ = (cid:0) , (cid:1) , thenthe CIB belief Π ∈ ∆( X ) will be the same for both U = +1 and U = − under any consistent belief generation system ψ ∗ . Then B = (Π , U ) will be the same for both U = +1 and U = − since U only takes one value. Hence Bob’sinduced stage behavioral strategy q should satisfy q = q .However q ∗ = (cid:0) + ε, − ε (cid:1) is such that q ∗ = q ∗ , hence ( p ∗ , q ∗ ) cannot be induced from any CIB-CNE.Since the induced behavioral strategy of any CIB-CNEshould form a BNE in the game among individuals, weconclude that a CIB-CNE does not exist in Example 3. Proof of Claim:
Denote Alice’s total expected payoff to be J ( p, q ) . Then J ( p, q ) = 12 ε (1 − p + p ) + 12 ((1 − p )(1 − q ) + p · q ) ++ 12 ((1 − p )(1 − q ) + p · q )= 12 ε (1 − p + p ) + 12 (2 − p − p )++ 12 (2 p + p − q + 12 (2 p + p − q . Since this is a zero-sum game, Alice’s expected payoff atequilibrium can be characterized as J ∗ = max p min q J ( p, q ) Alice plays p at some equilibrium if and only if min q J ( p, q ) = J ∗ . Define J ∗ ( p ) = min q J ( p, q ) . We com-pute J ∗ ( p ) = 12 ε (1 − p + p ) + 12 (2 − p − p )++ (3 p + 3 p ) − p + p ≤ , p + p ≤ (2 p + p −
1) 2 p + p > , p + p ≤ (2 p + p −
1) 2 p + p ≤ , p + p >
10 2 p + p > , p + p > The set of equlibrium strategies for Alice is the set of maxi-mizers of J ∗ ( p ) . Since J ∗ ( p ) is a continuous piecewise linearfunction, the set of maximizers can be found by comparingthe values at the extreme points of the pieces.We have J ∗ (0 ,
0) = 12 ε + 1 − ε ; J ∗ (cid:18) , (cid:19) = 12 ε ·
12 + 12 ·
32 + 12 · − ε + 12 ; J ∗ (cid:18) , (cid:19) = 12 ε ·
32 + 12 ·
32 + 12 · − ε + 12 ; J ∗ (1 ,
0) = 12 ε · · · J ∗ (0 ,
1) = 12 ε · · · ε + 12 ; J ∗ (cid:18) , (cid:19) = 12 ε + 12 ·
43 + 12 · ε + 23 ; J ∗ (1 ,
1) = 12 ε + 12 · ε. p p (1 , , ,
0) (1 , , , ) ( , ) Fig. 1. The pieces (polygons) for which J ∗ ( p ) is linear on. The extremepoints of the pieces are labeled. Since ε < , we have ( , ) to be the unique maximumamong the extreme points. Hence we have arg max p J ∗ ( p ) = { ( , ) } , i.e. Alice always plays p ∗ = ( , ) in any BNE ofthe game.Now, consider Bob’s equilibrium strategy. q ∗ is an equilib-rium strategy of Bob only if p ∗ ∈ arg max p J ( p, q ∗ ) .or each q , J ( p, q ) is a linear function of p and ∇ p J ( p, q ) = (cid:18) − ε −
12 + q + 12 q , ε −
12 + 12 q + q (cid:19) ∀ p ∈ (0 , . We need ∇ p J ( p, q ∗ ) (cid:12)(cid:12)(cid:12) p = p ∗ = (0 , . Hence − ε −
12 + q ∗ + 12 q ∗ = 0;12 ε −
12 + 12 q ∗ + q ∗ = 0 , which implies that q ∗ = ( + ε, − ε ) , proving the claim. K. Proof of Theorem 3
We use Theorem 2 to establish the existence of CIB-CNE:We show that for each t there always exists a pair ( λ ∗ t , ψ ∗ t ) such that λ ∗ t forms an equilibrium at t given ψ ∗ t , and ψ ∗ t is consistent with λ ∗ t . We provide a constructive proof ofexistence of CIB-CNE by proceeding backwards in time.Since d = 1 we have S it = X it − . The CCI consists of thebeliefs along with U t − .Consider the condensation of the information graph into adirected acyclic graph (DAG) whose nodes are strongly con-nected components. Each node may contain multiple teams.Consider one topological ordering of this DAG. Denote thenodes by [1] , [2] , · · · ( [ j ] is reachable from [ k ] only if k < j .)We use the notation X [ k ] t , Π [ k ] t to denote the vector of the sys-tem variables of the teams in a node. In particular, followingDefinition 13, we define Z [ k ] t = ( X [ k ] t − t , W [ k ] ,Yt ) . We also use [1 : k ] as a short hand for the set [1] ∪ [2] ∪ · · · ∪ [ k ] . Define B [1: k ] t = (Π [1: k ] t , U [1: k ] t − ) . (Note that the usage of superscripthere is different from the CCI B it defined in Definition 10.)We construct the solution first backwards in time, then inthe order of the node for each stage. For that matter, we needsome induction invariant on the value functions V it (as definedin Theorem 2) for the solution we are going to construct. Induction Invariant:
For each time t and each node index k , • V it ( b t , x it − ) depends on b t only through ( b [1: k − t , u it − ) for all teams i ∈ [ k ] , if [ k ] consists of only one team.(With some abuse of notation, we write V it ( b t , x it − ) = V it ( b [1: k − t , u it − , x it − ) in this case.) • V it ( b t , x it − ) depends on b t only through b [1: k ] t for allteams i ∈ [ k ] , if [ k ] consists of multiple public teams.(We write V it ( b t , x it − ) = V it ( b [1: k ] t , x it − ) in this case.) Induction Base:
For t = T + 1 we have V iT +1 ( · ) ≡ forall coordinators i ∈ I hence the induction invariant is true. Induction Step:
Suppose that the induction invariant is trueat time t + 1 for all nodes. We construct the solution so thatit is also true at time t .To complete this step we provide a procedure to solve thestage game. We argue that one can solve a series of optimiza-tion problems or finite games following the topological orderof the nodes through an inner induction step. Inner Induction Step:
Suppose that the first k − nodeshas been solved, and the equilibrium strategy λ ∗ [1: k − t uses only b [1: k − t along with private information. Suppose that theupdate rules ψ ∗ , [1: k − t have also been determined, and they useonly ( b [1: k − t , y [1: k − t , u [1: k − t ) . We now establish the sameproperty for ( λ [ k ] t , ψ [ k ] t ) . • If the k -th node contains a single coordinator i , thevalue to go is V it +1 ( B [1: k − t +1 , U it , X it ) by the inductionhypothesis. The instantaneous reward for a coordinator i in the k -th node can be expressed by r it ( X [1: k ] t , U [1: k ] t ) by the information graph. In the stage game, coordinator i chooses a prescription to maximize the expected valueof Q it ( b [1: k − t , Z [1: k ] t , Γ [1: k ] t ):= r it ( X [1: k ] t , U [1: k ] t ) + V it +1 ( B [1: k − t +1 , U it , X it ) , where B [1: k − t +1 = (Π [1: k − t +1 , U [1: k − t ) , Π jt +1 = ψ ∗ ,jt ( b [1: k − t , Y jt , U [1: k − t ) ∀ j ∈ [1 : k − , Y jt = ℓ jt ( X jt , U [1: k − t , W j,Yt ) ∀ j ∈ [1 : k − , U jt = Γ jt ( X jt ) ∀ j ∈ [1 : k ] . The expectation is computed using the belief β it (definedthrough Eq. (4) in Definition 13) along with λ ∗ [1: k − t thathas already been determined. It can be written as X ˜ s t , ˜ γ [1: k − t β it (˜ s t | x it − ) Q it ( b [1: k − t , ˜ s [1: k ] t , (˜ γ [1: k − t , γ it )) ×× Y j ∈ [1: k − λ jt (˜ γ jt | b [1: k − t , ˜ x jt − )= X ˜ s [1: k ] t , ˜ γ [1: k − t { ˜ x it − = x it − } P ( ˜ w [1: k ] ,Yt ) ×× Y j ∈ [1: k − π jt (˜ x jt − ) P (˜ x jt | ˜ x jt − , u [1: k − t − ) ×× Y j ∈ [1: k − λ ∗ jt (˜ γ jt | b [1: k − t , x jt − ) ×× P (˜ x it | x it − , u [1: k ] t − ) Q it ( b [1: k − t , ˜ s [1: k ] t , (˜ γ [1: k ] t , γ it )) . Therefore, the expected reward of coordinator i dependson b t through ( b [1: k − t , u it − ) . Coordinator i can choosethe optimal prescription based on ( b [1: k − t , u it − , x it − ) ,i.e. λ ∗ it ( b t , x it − ) = λ ∗ it ( b [1: k − t , u it − , x it − ) . We thenhave V it ( b t , x it − ) = V it ( b [1: k − t , u it − , x it − ) . The updaterule ψ ∗ , [ k ] t = ψ ∗ ,it is then determined to be an arbitraryupdate rule consistent with λ ∗ ,it , which can be chosenas a function from B [1: k ] t × Y [ k ] t × U [1: k ] t (instead of B t × Y [ k ] t × U t ) to Π [ k ] t +1 . • If the k -th node contains a group of public teams, thenupdate rules ˆ ψ ∗ , [ k ] t are fixed, irrespective of the stagegame strategies, i.e. there exist a unique update rule ˆ ψ ∗ ,it that is compatible with any λ ∗ ,it for a public team i .This update rule is a map from Y [ k ] t × U [1: k ] t to a vectorof delta measures on Q i ∈ [ k ] ∆( X it − ) , i.e. the map toecover X [ k ] t − from the observations (see Definition 16).The function takes U [1: k ] t as its argument due to the factthat the observations of the k -th node depends on U t only through U [1: k ] t .The value to go for each coordinator i can be expressedas V it +1 ( B [1: k ] t , X it − ) by induction hypothesis. The in-stantaneous reward can be written as r it ( X [1: k ] t , U [1: k ] t ) by the definition of the information dependency graph.In the stage game, coordinator i in the k -th node choosesa distribution η it on prescriptions to maximize the ex-pected value of Q it ( b [1: k ] t , Z [1: k ] t , Γ [1: k ] t ):= r it ( X [1: k ] t , U [1: k ] t ) + V it +1 ( B [1: k ] t +1 , X it ) , where B [1: k ] t +1 = (Π [1: k ] t +1 , U [1: k ] t ) , Π jt +1 = ψ ∗ ,jt ( b [1: k − t , Y jt , U [1: k − t ) ∀ j ∈ [1 : k − , Π [ k ] t +1 = ˆ ψ ∗ , [ k ] t ( b [1: k ] t , Y [1: k ] t , U [1: k ] t ) , Y jt = ℓ jt ( X jt , U [1: k ] t , W j,Yt ) ∀ j ∈ [1 : k ] , U jt = Γ jt ( X jt ) ∀ j ∈ [1 : k ] . The expectation is taken with respect to the belief β it (defined through Eq. (4) in Definition 13) and the strategyprediction λ [1: k ] t . This expectation can be written as X ˜ s t , ˜ γ [1: k ] t β it (˜ s t | x it − ) Q t ( b [1: k ] t , ˜ s [1: k ] t , ˜ γ [1: k ] t ) η it (˜ γ it ) ×× Y j ∈ [1: k ] j = i λ jt (˜ γ jt | b [1: k − t , ˜ x jt − )= X ˜ s [1: k ] t , ˜ γ [1: k ] t { ˜ x it − = x it − } P ( ˜ w [1: k ] ,Yt ) × Y j ∈ [1: k ] j = i π jt (˜ x jt − ) P (˜ x jt | ˜ x jt − , u [1: k ] t − ) λ ∗ jt (˜ γ jt | b [1: k ] t , ˜ x jt − ) ×× P (˜ x it | x it − , u [1: k ] t − ) η it (˜ γ it ) Q it ( b [1: k ] t , ˜ s [1: k ] t , ˜ γ [1: k ] t ) , which dependents only on b t only through b [1: k ] t . There-fore, the stage game defined in Definition 13 in-duces a finite game between the coordinators in the k -th node (instead of all coordinators) with parameter ( b [1: k ] t , ( ψ ∗ , [1: k − t , ˆ ψ ∗ , [ k ] t )) (instead of ( b t , ψ t ) ), where λ ∗ [1: k − t has been fixed. Teams in the k -th node form/-play a stage game where the first k − nodes act like na-ture, while the coordinators after k -th node have no effectin the payoffs of the coordinators in the k -th node. Hence,a coordinator i in the k -th node can based their decisionon ( b [1: k ] t , x it − ) , i.e. λ ∗ it ( b t , x it − ) = λ ∗ it ( b [1: k ] t , x it − ) .We also have V it ( b t , x it − ) = V it ( b [1: k ] t , x it − ) . The updaterule is determined by ψ ∗ , [ k ] t = ˆ ψ ∗ , [ k ] t , which is guaranteedto be consistent with λ ∗ [ k ] t .In summary, we determine ( λ ∗ t , ψ ∗ t ) using a node-by-nodeapproach. If the k -th node consists of one team, then we first determine λ ∗ [ k ] t from an optimization problem dependenton ( λ ∗ [1: k − t , ψ ∗ , [1: k − t ) , and then determine ψ ∗ , [ k ] t . If the k -th node consists of multiple public players, then we firstdetermine ψ ∗ , [ k ] t and then solve λ ∗ [ k ] t from a finite gamedependent on ( λ ∗ [1: k − t , ψ ∗ , [1: k ] t ) . Hence we have constructedthe solution and established both inner and outer inductionsteps, proving the theorem. L. Proof of Theorem 4
We prove the Theorem for d = 1 . The proof idea for d > is similar.We will prove a stronger result. For each Π it ∈ ∆( X it − ) ,define the corresponding ˆΠ it ∈ ∆( X t ) by Π it ( x it ) := X ˜ x it − Π it (˜ x it − ) P ( x it | ˜ x it − ) . Define ˆ ψ it to be the signaling-free update function, i.e. thebelief update function such that Π it +1 ( x it ) = ˆ ψ it (Π it , Y it ) = Π it ( x it ) P ( Y it | x it ) P ˜ y it Π it ( x it ) P (˜ y it | x it ) . Define open-loop prescriptions as the prescriptions thatsimply instruct members of a team to take a certain actionirrespective their private information. We will show that thereexist an equilibrium where each team plays a common infor-mation based signaling-free (CIBSF) strategy, i.e. the commonbelief generation system for all coordinators is given by thesignaling-free update functions ˆ ψ , and coordinator i choosesrandomized open-loop prescriptions based on Π t = (Π it ) i ∈I instead of ( B t , X it − ) . Induction Invariant: V it ( B t , X it − ) = V it ( Π t , X it − ) . Induction Base:
The induction variant is true for t = T + 1 since V iT +1 ( · ) ≡ for all i ∈ I . Induction Step:
Suppose that the induction variant is truefor t + 1 , prove it for time t .Let ˆ ψ t be the signaling-free update rule. We solve thestage game G t ( V t +1 , ˆ ψ t , b t ) . In the stage game, coordinator i chooses a prescription to maximize the expectation of r it ( X − it , U t ) + V it +1 ( Π t +1 , X it ) , where Π kt +1 ( x kt +1 ) = X ˜ x kt Π kt +1 (˜ x kt ) P ( x kt +1 | ˜ x kt ) ∀ x kt +1 ∈ X kt +1 , Π kt +1 = ˆ ψ kt (Π kt , Y kt ) ∀ k ∈ I , Y kt = ℓ kt ( X kt , W k,Yt ) ∀ k ∈ I ,U k,jt = Γ k,jt ( X k,jt ) ∀ ( k, j ) ∈ N . Since V it +1 ( Π t +1 , X it ) does not depend on coordinator i ’s prescriptions, coordinator i only need to maximize theexpectation of r it ( X − it , U t ) , which is X ˜ x − it − t , ˜ γ − it Y j = i π jt (˜ x jt − ) P (˜ x jt | ˜ x jt − ) λ jt (˜ γ jt | b t , ˜ x jt − ) ×× r it (˜ x − it , (˜ γ − it (˜ x − it ) , γ it ( x it ))) . laim: In the stage game, if all coordinators − i use CIBSFstrategy, then coordinator i can respond with a CIBSF strategy. Proof of Claim:.
Let η kt : Π t ∆( U kt ) be the CIBSFstrategy of coordinator k = i . Then coordinator i ’s expectedpayoff given γ it can be written as X ˜ x − it − t , ˜ u − it Y j = i π jt (˜ x jt − ) P (˜ x jt | ˜ x jt − ) η jt (˜ u jt | π t ) ×× r it (˜ x − it , (˜ u − it , γ it ( x it )))= X ˜ x − it , ˜ u − it Y j = i X ˜ x jt − π jt (˜ x jt − ) P (˜ x jt | ˜ x jt − ) η jt (˜ u jt | π t ) ×× r it (˜ x − it , (˜ u − it , γ it ( x it )))= X ˜ x − it , ˜ u − it Y j = i π jt (˜ x jt ) η jt (˜ u jt | π t ) r it (˜ x − it , (˜ u − it , γ it ( x it )))=: r it ( π t , η − it , γ it ( x it )) . Hence coordinator i can respond with a prescription γ it suchthat γ it ( x it ) = u it for all x it , where u it ∈ arg max ˜ u it r it ( π t , η − it , ˜ u it ) , can be chosen based on ( π t , η − it ) , proving the claim.Given the claim, we conclude that there exist a stage gameequilibrium where all coordinators play CIBSF strategies:Define a new stage game where we restrict each coordinator toCIBSF strategies. A best response in the restricted stage gamewill be also a best response in the original stage game due tothe claim. The restricted game is a finite game (It is a game ofsymmetrical information with parameter π t where coordinator i ’s action is u it and its payoff is a function of π t and u t .) thatalways has an equilibrium. The equilibrium strategy will beconsistent with ˆ ψ t due to Lemma 11. Lemma 11.
The signaling-free update rule ˆ ψ it is consistentwith any λ it : B t × X it − ∆( Γ it ) that corresponds to aCIBSF strategy at time t .Proof. Can be done with standard arguments for strategyindependence of belief.Let η ∗ t = ( η ∗ jt ) j ∈I , η ∗ jt : Π t ∆( U jt ) be a CIBSFstrategy profile that is a stage game equilibrium. Then thevalue function V it ( b t , x it − ) = (cid:18) max ˜ u it r it ( π t , η ∗− it , ˜ u it ) (cid:19) ++ X ˜ x t , ˜ y t V it +1 ( ˆ ψ t ( π t , ˜ y t ) , ˜ x it ) P (˜ y t | ˜ x t ) P (˜ x it | x it − ) π − it (˜ x − it ) depends on ( b t , x it − ) only through ( π t , x it − ) , establishing theinduction step. M. Proof of Lemma 9
For ease of illustration we prove the result for d = 2 . Theresult for d = 1 is trivially true, and the result for d > canbe proved following a similar logic to that of this proof.The key idea is to apply person-by-person refinement ofa team strategy. Let g − i be some behavioral coordinationstrategy profile for coordinators other than coordinator i . Let µ i denote a pure team strategy that is a best response to g − i .Note that we are not considering a coordination strategy andno randomization is considered. At time t , agent ( i, j ) decideson her action through u i,jt = µ i,jt ( h i,jt ) . To proceed we firstprove the following lemma. Lemma 12.
Fixing g − i , for any pure team strategy profile µ i and any ( i, j ) ∈ N i , there exist a pure team strategy profile ˜ µ i such that (1) ˜ µ i,jt ( h i,jt ) does not depend on x it − ; (2) ˜ µ i, − j = µ i, − j ; (3) J i (˜ µ i , g − i ) ≥ J i ( µ i , g − i ) . Given the result of Lemma 12, we can refine any best-response pure strategy µ i in a person-by-person manner toobtain a pure strategy in which µ i,jt ( h i,jt ) does not depend on x i,jt − for all ( i, j ) ∈ N i . Then, one can transform the newpure strategy µ i into one of its equivalent pure coordinationstrategy ν i , where ν i always assigns simple prescriptions. Proof of Lemma 12.
Fix the strategy µ i, − j for members ofteam i other than ( i, j ) , and also fix g − i for other teams. Wetry to let agent ( i, j ) to refine her strategy to maximize team i ’s expected reward, given that the strategy of others are fixed.We argue that agent ( i, j ) is facing a POMDP problem with: • State: ( Y t − , U t − , X − i t , Γ − i t − , X i, − j t , X i,jt ) • Observation: H i,jt = ( Y t − , U t − , X i t − , X i,jt − t ) • Action: U i,jt • Instantaneous reward: r it ( X t , U t ) where U − it follows the distribution induced from the randomprescriptions generated by g − it and ( H t , X − i t , Γ − i t − ) , and U i, − jt is generated from µ i, − jt .By the standard POMDP structural result, the conditionaldistribution of the state given observation is an informationstate for agent ( i, j ) . Notice that X i,jt − only appears in theobservation but not in the state. Furthermore, Y t − , U t − ,and X i,jt are perfectly observed by agent ( i, j ) . Therefore,to prove that agent ( i, j ) does not need to use X i,jt − , it issufficient to prove the following claim: Claim: P µ i, − j ,g − i ( x − i t , γ − i t − , x i, − jt − t | h i,jt ) does not dependon x i,jt − . Proof of Claim.
Due to conditional independence among dif-ferent teams (Lemma 3), we have P µ i, − j ,g − i ( x − i t , γ − i t − , x i, − jt − t | h it , x i,jt − t )= P µ i, − j ( x i, − jt − t | h it , x i,jt − t ) Y k = i P g − i ( x k t , γ k t − | h t ) . (25)Note that the first conditional belief term on the right handside of (25) does not depend on strategy g − i (by Lemma 3) and µ i,j t − (by the standard policy-independence property of beliefin POMDP). As such, one can consider the above conditionalelief term assuming that other teams play according to open-loop strategy ˆ g − i , which generates u − i t − , and agent ( i, j ) playaccording to open-loop strategy ˆ g i,j , which generates u i,j t − .Consequently, let ˆ x − it − ∈ X − it − be arbitrary, using Bayes’rule we obtain P µ i, − j ( x i, − jt − t | h it , x i,jt − t , ˆ x − it − )= P ˆ g ( x i, − jt − t , y t − , u t − , x i,jt − t | h t − , x i t − , ˆ x − it − ) P ˜ x i, − jt − t P ˆ g (˜ x i, − jt − t , y t − , u t − , x i,jt − t | h t − , x i t − , ˆ x − it − ) (26)where ˆ g = (ˆ g i,j , µ i, − j , ˆ g − i ) .We have P µ i, − j , ˆ g i,j , ˆ g − i ( x i, − jt − t , y t − , u t − , x i,jt − t | h t − , x i t − , ˆ x − it − )= P µ i, − j , ˆ g i,j , ˆ g − i ( x i, − jt , x i,jt | y t − , u t − , x i t − , ˆ x − it − ) ×× P µ i, − j , ˆ g i,j , ˆ g − i ( y t − | y t − , u t − , x i t − , ˆ x − it − ) × P µ i, − j , ˆ g i,j , ˆ g − i ( u it − | y t − , u t − , x i t − , ˆ x − it − ) × P µ i, − j , ˆ g i,j , ˆ g − i ( x it − | y t − , u t − , x i t − , ˆ x − it − )= P ( x i, − jt | x i, − jt − , u t − ) P ( x i,jt | x i,jt − , u t − ) P ( y it − | x it − , u t − ) ×× P ( y − it − | ˆ x − it − , u t − ) × { µ i, − jt − ( y t − ,u t − ,x i t − ,x i, − jt − t − )= u i, − jt − } ×× P ( x i, − jt − | x i, − jt − , u t − ) P ( x i,jt − | x i,jt − , u t − )= F t ( x i, − jt − t , h it ) · G t ( x i,jt − t , ˆ x − it − , h it ) for some function F t and G t .Through Eq. (26) we have P µ i, − j ( x i, − jt − t | h t , x i t − , x i,jt − t , ˆ x − it − )= F t ( x i, − jt − t , h it ) · G t ( x i,jt − t , ˆ x − it − , h it ) P ˜ x i, − jt − t F t (˜ x i, − jt − t , h it ) · G t ( x i,jt − t , ˆ x − it − , h it )= F t ( x i, − jt − t , h it ) P ˜ x i, − jt − t F t (˜ x i, − jt − t , h it ) to be not dependent on x i,jt − t and ˆ x − it − (in particular, notdependent on x i,jt − ). Hence we proved the claim.Therefore, agent ( i, j ) can solve the POMDP problem andobtain an optimal strategy ˜ µ i,j that does not use X i,jt − tochoose actions. Define ˜ µ i = (˜ µ i,j , µ i, − j ) . One can verify thatconditions (1)-(3) of Lemma 12 are satisfied. N. Proof of Proposition 3