[PDF] On the Equilibrium Elicitation of Markov Games Through Information Design

Abstract

This work considers a novel information design problem and studies how the craft of payoff-relevant environmental signals solely can influence the behaviors of intelligent agents. The agents' strategic interactions are captured by an incomplete-information Markov game, in which each agent first selects one environmental signal from multiple signal sources as additional payoff-relevant information and then takes an action. There is a rational information designer (designer) who possesses one signal source and aims to control the equilibrium behaviors of the agents by designing the information structure of her signals sent to the agents. An obedient principle is established which states that it is without loss of generality to focus on the direct information design when the information design incentivizes each agent to select the signal sent by the designer, such that the design process avoids the predictions of the agents' strategic selection behaviors. We then introduce the design protocol given a goal of the designer referred to as obedient implementability (OIL) and characterize the OIL in a class of obedient perfect Bayesian Markov Nash equilibria (O-PBME). A new framework for information design is proposed based on an approach of maximizing the optimal slack variables. Finally, we formulate the designer's goal selection problem and characterize it in terms of information design by establishing a relationship between the O-PBME and the Bayesian Markov correlated equilibria, in which we build upon the revelation principle in classic information design in economics. The proposed approach can be applied to elicit desired behaviors of multi-agent systems in competing as well as cooperating settings and be extended to heterogeneous stochastic games in the complete- and the incomplete-information environments.

Full PDF

aa r X i v : . [ c s . M A ] F e b O N THE E QUILIBRIUM E LIC ITATION OF M AR KOV G AMES T HROUGH I NFOR MATION D ESIGN

A P

REPRINT

Tao Zhang ∗ Department of Electrical and Computer EngineeringNew York UniversityBrooklyn, NY, 11201 [email protected]

Quanyan Zhu

Department of Electrical and Computer EngineeringNew York UniversityBrooklyn, NY, 11201 [email protected]

February 16, 2021 A BSTRACT

This work considers a novel information design problem and studies how the craft of payoff-relevantenvironmental signals solely can inﬂuence the behaviors of intelligent agents. The agents’ strategicinteractions are captured by an incomplete-information Markov game, in which each agent ﬁrstselects one environmental signal from multiple signal sources as additional payoff-relevant informa-tion and then takes an action. There is a rational information designer (designer) who possesses onesignal source and aims to control the equilibrium behaviors of the agents by designing the informa-tion structure of her signals sent to the agents. An obedient principle is established which statesthat it is without loss of generality to focus on the direct information design when the informationdesign incentivizes each agent to select the signal sent by the designer, such that the design processavoids the predictions of the agents’ strategic selection behaviors. We then introduce the design pro-tocol given a goal of the designer referred to as obedient implementability (OIL) and characterizethe OIL in a class of obedient perfect Bayesian Markov Nash equilibria (O-PBME). A new frame-work for information design is proposed based on an approach of maximizing the optimal slackvariables. Finally, we formulate the designer’s goal selection problem and characterize it in terms ofinformation design by establishing a relationship between the O-PBME and the Bayesian Markovcorrelated equilibria, in which we build upon the revelation principle in classic information design ineconomics. The proposed approach can be applied to elicit desired behaviors of multi-agent systemsin competing as well as cooperating settings and be extended to heterogeneous stochastic games inthe complete- and the incomplete-information environments. K eywords Information design · Markov game · Manipulation · Multiagent System · Artiﬁcial intelligence ∗ Corresponding author.

PREPRINT - F

EBRUARY

16, 2021

Building rational multi-agent system is an important research desideratum in Artiﬁcial Intelligence. In goal-directeddecision making systems, an agent’s action is controlled by its consequence [1]. In a game, the consequence of anagent’s action is the outcome of the game, given as the reward of taking that action as well as the actions of hisopponents, which situates the optimality criterion of each agent’s decision making in the game. A rational agent’sreward may also depend on the payoff-relevant information, in addition to the actions. The information may includethe situation of the agents in a game, referred to as the state of the world, as well as his knowledge about his oppo-nents’ diverging interests and their preferences over the outcomes of the game. Incorporating such payoff-relevantinformation in his decisions constitutes an essential part of an agent’s rationality in the strategic interactions with hisopponents. Hence, one may re-direct the goal achievement of rational agents in a game by information provision. Ineconomics, this refers to as information design , which studies how an information designer (she) can inﬂuence agents’optimal behaviors in a game to achieve her own objective, through the design of information provided to the game [2].Referred to as the inverse game theory, mechanism design is a well-developed mathematical theory in economics thatprovides general principles of how to design rules of games (e.g., rewarding systems with speciﬁcations of actions andoutcomes) to inﬂuence the agents’ strategic interactions and achieve system-wide goals while treating the informationas given. Information design, on the other hand, considers the circumstances when the information in the environmentis under the control of the system designer and offers a new approach to indirectly elicit agents’ behaviors by keepingthe game rules ﬁxed [3].This work considers a ﬁnite-agent inﬁnite-horizon Markov game in an incomplete-information environment. Eachagent privately possesses a payoff-relevant information, called type, with a commonly known prior probability distri-bution. At each period of time, agents observe a payoff-relevant global state (state). In addition to the type and thestate, each agent observes a batch of signals (signal batch, batch) at each period and then strategically chooses onesignal from the batch as the additional information to his decision of actions. Each agent’s one-period reward is deter-mined by his own action, the actions of his opponents, the global state, and his choice of signal. We refer to this gameas a base Markov game (BMG). The transition of the state, the prior of type, and the distribution of signals are referredto as the information structure of the MBG. In a BMG, each agent’s behavior includes selecting a signal according to a selection rule and taking an action according to a policy . Here, each agents’ selection of signal and the choice of actionare coupled since the selected signal enters the policy to determine the choice of the action. If a mechanism designeraims to incentivize the agents to behave in her desired way, she directly modiﬁes the BMG– reversing the game –bychanging the rules of encounters, including changing the reward function associated with actions and outcomes, whiletreating the information structure as part of the environment. An information designer, however, treats the BMG asﬁxed and modiﬁes the information structure to elicit agents’ equilibrium behaviors that coincide with her objective.We study a novel dynamic information design problem in the BMG in which there are multiple sources of signals(signal sources, sources) and each of them sends one signal to each agent. The signals sent by all sources constitutethe signal batch observed by each agent at each time. Among these sources, there is one rational information designer(referred to as principal , she) who controls one signal source and intends to strategically craft the information structureof her signal by choosing a signaling rule to indirectly control the equilibrium of the BMG. We consider that othersources of signals provide additional information to the agents in a non-strategic take-it-or-leave-it manner. The goal2

PREPRINT - F

EBRUARY

16, 2021of the principal is to induce the agents to take actions according to an equilibrium policy that is desired by the principal.However, the principal has no ability to directly program the agents’ behaviors to force them to take certain actions.Instead, her information design should provide incentive to rational agents to behave in her favor. We study the extentto which the provision of signals along by controlling a single signal source can inﬂuence the agents’ behavior in aBMG, when the agents have the freedom to choose any available signal in the batch. We will name the BMG with arational principal in this setting as an expanded Markov game (EMG).Since the principal’s design problem keeps the base game unchanged, our model ﬁts the scenarios when the agents areintrinsically motivated and their internal reward systems translate information from external environment into internalreward signals [4]. Intrinsically-motivated rational agents can be human decision makers with intrinsic psychologicalpreferences or intelligent agents programmed with internal reward system. The setting of multiple sources of additionalinformation captures the circumstances when the environment is perturbed by noisy information, in which the agentsmay improperly use redundant and useless information to make their decisions that may deviate from the systemdesigner’s desire. Also, the principal can be an adversary who aims to manipulate the strategic interactions in amulti-agent system through the provision of misinformation, without intruding each agent’s local system to make anyphysical or digital modiﬁcations.Although the principal’s objective of information design in an EMG is to elicit an equilibrium policy, her designproblem has to take into consideration how the agents select the signals from their signal batches because each agent’schoice of action is coupled with his selection of signal. In an information design problem, the principal chooses aninformation structure such that each agent selects a signal using a selection rule and then takes an action accordingto a policy which matches the principal’s goal. We use admissibility to denote the constraint such that the agents’equilibrium policy coincides with the principal’s goal. We characterize the information design problem in an EMGinto two classes: the indirect information design (IID) and the direct information design (DID). An IID is indirect inthe sense that the signals sent by the principal may not be selected by some agents, thereby the actions taken by thoseagents are independent of the principal’s (realized) signals. However, even though her signal does not enter an agent’spolicy to take an action, the principal can still inﬂuence the agent’s action because the distribution of the signal batchis inﬂuenced by her information structure (given the distributions of signals from other sources) which affects theagents’ selections of signals. Hence, the agents’ behaviors indirectly depend on the principal’s choice of informationstructure. IID requires the principal to accurately predict each agents’ strategic selection rules as well as the inducedpolicies. In DID problems, on the other hand, each agent always selects the signal sent by the principal and then takesan action. Thus, the realizations of the principal’s signals directly enter the agents’ choice of actions. In addition to theadmissibility, another key restriction of the principal’s DID problem is a notion of obedience which requires that, withthe information structure of the principal, each agent is incentivized to select the signal from the principal rather thanchoose one from other signal sources. The key simpliﬁcation provided by the DID is that the principal’s predictionof the agents’ strategic selection rules is replaced by a straightforward obedient selection rule that always prefers theprincipal’s signals.This paper makes three major contributions to the foundations of information design. First, we deﬁne a dynamicinformation design problem in an environment where the agents have the freedom to choose any available signal asadditional information. An obedience principle is established and formally states that for every IID that leads to an3

PREPRINT - F

EBRUARY

16, 2021equilibrium policy, there exists a DID that leads to the same equilibrium policy. As a result, the principal can focus onthe DID of the EMG. Captured by the notion of obedient implementability , the principal’s DID problem is constrainedby the obedient condition that incentivizes the agents to select the principal’s signals and the admissibility conditionsuch that the agents take actions which meets the principal’s goal equilibrium. Our information design problem isdistinguished from others in economics that study the commitment of the information design in a game when there isonly a single source of additional information in static settings (e.g., [5, 3, 6, 7]) as well as in dynamic environment(e.g., [8, 9, 10, 11, 12, 13]) and the settings in which the agents do not make a choice from multiple designers (e.g.,[14]). Second, we propose a new solution concept termed obedient perfect Bayesian Markov Nash equilibrium (O-PBME), which allows us to handle the undesirable equilibrium deviations of agents in DID in a principled manner.Speciﬁcally, by bridging our incomplete-information Markov game with dynamic programming and uncovering theclose relationship between the O-MBNE and the optimization of the occupancy measures, we characterize the obedientimplementability and explicitly construct the principal’s DID problem. Third, we formulate the principal’s optimalgoal selection problem and transform it to an optimal DID problem in which the admissibility condition is replaced bythe optimality of the induced equilibrium policy with respect to the principal’s objective. A representation principleis obtained and formally states that the principal’s goal selection from a set of equilibria referred to as the

BayesianMarkov correlated equilibria can be fully characterized by an information design that is implementable in an O-PBNE.

We follow a growing line of research on creating incentives for interacting agents to behave in a desired way. Themost straightforward way is based on mechanism design approaches that properly provide reward incentives (e.g.,contingent payments, penalty, supply of resources) by directly modifying the game itself to change the induced pref-erences of the agents over actions. Mechanism design approaches have been fruitfully studied in both static [15] aswell as dynamic environment [16, 17, 18]. For example, auctions [19, 20] specify the way in which the agents canplace their bid and clarify how the agents pay for the items; in matching markets [21, 22], matching rules matchesagents in one side of a market to agents of another side that directly affect the payoff of each matched individuals. Inreinforcement learning literature, reward engineering [23, 24, 25] is similar to mechanism design that directly craftsthe reward functions of the agents that post speciﬁcations of the learning goal.Our work lies in another direction: the information design. Information design studies how to inﬂuence the outcomesof the decision makings by choosing signal (also referred to as signal structure, information structure, Blackwellexperiment, or data-generating process) whose realizations are observed by the agents [26]. In a seminar paper [7],Kamenica and Gentzkow has introduced

Bayesian persuasion in which there is an informed sender and an uninformedreceiver. The sender is endowed to commit to choosing any probability distribution (i.e., the information structure)of the signals as a function of the state of the world which is payoff-relevant to and unobserved by the receiver. TheBayesian persuasion can be interpreted as a communication device that is used by the sender to inform the receiverthrough the signals that contain knowledge about the state of the world. Hence, the sender controls what the agentgets to know about the payoff-relevant state. With the knowledge about the information structure, the receiver formsa posterior belief about the unobserved state based on the received signal. Hence, the information design of Bayesianpersuasion is also referred to as an exercise in belief manipulation. Other works alongside with the Bayesian persuasioninclude [27, 28, 29, 30]. In [5], Mathevet et al. extends the single-agent Bayesian persuasion of [7] to a multi-agent4

PREPRINT - F

EBRUARY

16, 2021game and formulate the information design of inﬂuencing agents’ behaviors through inducing distributions over agents’beliefs. In [6], Bergemann and Morris have also considered information design in games. They have formulated theMyersonian approach for the information design in an incomplete-information environment. The essential of theMyersonian information design is the notion of Bayes correlated equilibrium, which characterizes the all possibleBayesian Nash equilibrium outcomes that could be induced by all available information structures. The Myersonianapproach avoids the modeling of belief hierarchies [31] and constructs the information design problem as a linearprogramming. Information design has been applied in a variety of areas to study and improve real-world decisionmaking protocols, including stress test in ﬁnance [32, 33], law enforcement and security [34, 35], censorship [36],routing system [37], ﬁnance and insurance [38, 39, 40]. Kamenica [26] has provided a recent survey of the literatureof Bayesian persuasion and information design.This work is based on Myersonian approaches and fundamentally differs from existing works on the informationdesign. First, we consider a different environment. Speciﬁcally, we consider the setting when there are multiplesources of signals and each agent chooses one realized signal as an additional (payoff-relevant) information at eachtime. Among these sources of signals, there is an information designer who controls one of these sources and aimsto induce equilibrium outcomes of the incomplete-information Markov game by strategically crafting informationstructures. Second, other than only taking actions, each agent in our model makes a coupled decision of selecting arealized signal and taking an action. Hence, the characterization of the solution concepts in our work is different fromthe equilibrium analysis in other works. Third, we also provide an approach with an explicit formulation to relaxingthe optimal information design problem.

Organization.

The rest of the paper is organized as follows. Section 2 describes the background and the basic conceptsrelated to this work. In Section 3, we describe the model and formally deﬁne the information design problem. Thenotions of implementabilities are introduced to describe the optimality of the information design for the indirect and thedirect settings. The obedient perfect Bayesian Markov Nash equilibrium (O-PBME) is deﬁned as the solution conceptof our information design. In Section 4 characterizes the obedient implementability in O-PBME by formulating anexplicit design regime of the principle’s information structure. Section 6 concludes the work.

Convention.

For the compactness of notations, we only show the elements, but not the sets, over which are summedunder the summation operator. The notations are summarized in Appendix ?? .In this section, we review fundamental concepts in game theory to situate our contributions of this work. This workfocuses on games of n self-interested agents, N < ∞ , denoted by N ≡ [ n ] , whose action space is given as A ≡{A i } i ∈N . A typical agent is referred to as agent i , i ∈ N . Norm-form (or strategic-form) is a basic representation ofa static game:

Deﬁnition 0.1 (Normal-Form Game [41]) . A normal-form game is deﬁned by a tuple b G ≡ < N , A , u > . u ≡{ u i } i ∈N , where u i : A R is the payoff function of agent i ∈ N . A normal-form game considers that agents’ payoffs for the outcomes of the game are common knowledge in equilib-rium. Each agent i ∈ N simultaneously chooses an action a i ∈ A i and receives a payoff u i ( a i , a − i ) when other5 PREPRINT - F

EBRUARY

16, 2021agents choose actions a − i . Bayesian games extend the normal-form games by capturing settings in which agents hold private information . The private information characterizes, e.g., the agent’s preference or taste over the outcomes ofthe game, and determines the payoffs the agent may obtain for every ourcome of the game. Unlike normal-form games,each agent i in a Bayesian game does not know the types of all other agents, θ − i ∈ Θ − i . A common approach tomodelling this incomplete information setting is by adopting Harsanyi’s idea of introducing a move by the Nature [42],which handles the agents’ uncertainty about others by transforming the incomplete information game into a imperfectinformation game. In Harsanyi’s model, each agent’s private information is known as type and is randomly chosen byNature according to some prior distribution, which is commonly known by all the agents and is referred to as commonprior . Deﬁnition 0.2 (Bayesian Game [43]) . A Bayesian game b G B is deﬁned by a tuple b G B ≡ < N , A , Θ , d θ , u > . Θ ≡× i ∈N Θ i , where Θ i is a type space of agent i ∈ N ; d θ ≡ { d θi } i ∈N , where d θi is the prior distribution of agent i ’s type θ i ∈ Θ i ; u ≡ { u i } i ∈N , where u i : Θ i × A R is the payoff function of agent i ∈ N . Based on his type, each agent simultaneously takes an action. Each agent i ∈ N receives a payoff u i ( θ i , a i , a − i ) ,when his type is θ i , he takes action a i and others a − i . We refer to O ≡ < Θ , d θ > as the global information structure of the game. Let O − i ≡ < Θ − i , d θ − i > denote the information structure of agents other than i . Given a globalinformation structure O , we will write the observation as o i ≡ < θ i | O > , which contains the information observed byagent i . Here, O is common knowledge and o i is private information of agent i and only partially known (imperfectinformation) by other agents through O .Markov games generalize normal-form games to dynamic settings as well as Markov decision processing to multi-agent interactions. A N -agent inﬁnite-horizon Markov game is a complete-information game, in which O = < S , d s , T s > , where S is a ﬁnite set of states, d s is the initial distribution of the state, and T s : S × A ∆( S ) is the transition function of the states; all the agents observe the same information, i.e., o i ( s t | O ) = o j ( s t | O ) , for all i = j . Each realization of the state, s t is payoff-relevant to all agents and is commonly observed. The joint actionsof agents partially control the dynamics of the states, i.e., the probability distribution of the next state is given by T s ( ·| s t , a t ) , when the current state is s t and agents take joint actions a t ≡ { a i,t } i ∈N ∈ A . Deﬁnition 0.3 (Markov Game) . A Markov game c M is deﬁned by a tuple c M ≡ < N , A , O , { R i } i ∈N > . R i : S × A R is a reward function of agent i ∈ N that maps state and joint-actions to a reward. A solution to c M is a policy proﬁle π : S 7→ ∆( A ) , which speciﬁes the joint actions of agents given the state. In aMarkov game, π can be either independent (i.e., π ( a t | s t ) = Q i ∈N π i ( a i,t | s t ) ) or correlated (i.e., a joint function). Inthis work, we extends Markov games to an incomplete-information setting based on Harsanyi’s model. The global in-formation structure of the game which is commonly known is O = < S , Θ , d s , d θ , T s , T θ > , where T θ ≡ {T θi } i ∈N .At each period t ≥ , agent i observes o i = < s t , θ i,t | O > , where s t is commonly observed and θ i,t is the privateinformation of agent i at period t . A special case of imperfect-information Markov game is when agents’ private typesare static, i.e., O B = < S , Θ , d s , d θ , T s > . We refer to such Markov game as a canonical Bayesian Markov game : Deﬁnition 0.4 (Canonical Bayesian Markov game) . A canonical Bayesian Markov game c M B is deﬁned by a tuple c M B ≡ < N , A , O B , { R i } i ∈N > . R i : S × A × Θ i R is a reward function of agent i ∈ N that maps state, jointactions, and his type to a reward. PREPRINT - F

EBRUARY

16, 2021A solution to c M B is a belief-policy proﬁle < µ , π B > . Here, µ ≡ { µ i } i ∈N is the belief system of the agentsand µ i ∈ ∆( Θ − i ) is each agent i ’s belief about other agents’ private information θ − i . The policy proﬁle π B : S × Θ ∆( A ) , such that π B ( a t | s t , θ ) speciﬁes the probability distribution of the joint actions a t given the state s t and joint types θ . Similar to the policy proﬁle in c M , π B can be either an independent function (i.e., π B ( a t | s t , θ ) = Q i ∈N π Bi ( a i,t | s t , θ i ) ) or a correlated function. We write Π as the set of policy proﬁles and Π i as a set of policies ofagent i , for i ∈ N .The next step is to deﬁne the optimality criteria of c M B . Agents’ decision makings (i.e., determining π B ) are guidedby each agent’s expected payoff (discounted by < γ ≤ ). Speciﬁcally, given the global information structure O B and the joint policy, we deﬁne agent i ’s discounted expected payoff as: J i ( π B ; µ i | O B ) ≡ E µ i π B h ∞ X t =0 γ t R i ( s t , a t | θ i ) (cid:12)(cid:12)(cid:12) O B i , (1)where E µ i π B h · (cid:12)(cid:12)(cid:12) O B i denotes the expectation with respect to the unique probability law induced by O B and < π B , µ i > .We deﬁne the notion Bayesian Markov Nash equilibrium as a solution concept of c M B that extends the concept of Bayesian Nash equilibrium [44] to our imperfect-information Markov setting:

Deﬁnition 0.5 (BME) . A proﬁle < π BME , µ > constitutes a Bayesian Markov Nash Equilibrium (BME) if thefollowings hold, for all i ∈ N :(i) Optimality: J i ( π BME ; µ i | O B ) = sup π Bi ∈ Π i J i ( π BME − i , π Bi ; µ i | O B ) , (2) where π BME − i = Q j ∈N \{ i } π BMEj ;(ii) Consistency: µ i ( θ − i | s t , a − i,t ) = π BME − i ( a − i,t | s t , θ − i ) d θ − i, ( θ − i ) P s ′ t , θ ′− i,t π BME − i ( a − i,t | s ′ t , θ ′− i ) d θ − i, ( θ ′− i,t ) . (3)In a BME, the optimality (2) says that each agent i ’s any deviation from equilibrium π BME is not proﬁtable (givenall other agents playing equilibrium policy π BME − i ). The consistency (3) requires that in a BME each agent i ’s beliefabout other agents’ type θ − i to be consistent with the policies played by other agents.Another solution concept for c M B is Bayesian Markov correlated equilibrium , which generalizes the BME, such thatthe equilibrium policy proﬁle π is a correlated function. Given ( s t , a t , θ i ) , deﬁne agent i ’s interim expected payoff(discounted by < γ ≤ ): J i,t ( s t , a t , θ i ; π | O B ) ≡ E µ i π h X τ ≥ t γ τ R i ( s τ , a τ | θ i ) (cid:12)(cid:12)(cid:12) O B i . (4) Deﬁnition 0.6 (BMCE) . A proﬁle < π BMCE , µ > constitutes a Bayesian Markov correlated equilibrium (BMCE) ifthe followings hold, for all i ∈ N :(i) Optimality: for any s t ∈ S , a ′ i,t ∈ A i , θ i ∈ Θ i , E µ i a − i,t ∼ π BMCE − i h J i,t ( s t , a i,t , a − i,t , θ i ; π BMCE | O ) i ≥ E µ i a − i,t ∼ π BMCE − i h J i,t ( s t , a ′ i,t , a − i,t , θ i ; π BMCE | O ) i , (5) where P a ′− i,t , θ ′− i π BMCE ( a i,t , a ′− i,t | s t , θ i , θ ′− i ) > . PREPRINT - F

EBRUARY

16, 2021 (ii) Consistency: µ i ( θ − i | s t , a − i,t ) = π BMCE − i ( a − i,t | s t , θ − i ) d θ − i, ( θ − i ) P s ′ t , θ ′− i,t π BMCE − i ( a − i,t | s ′ t , θ ′− i ) d θ − i, ( θ ′− i,t ) , (6) where π BMCE − i ( a − i,t | s t , θ − i ) = P a ′ i,t ,θ ′ i π BMCE ( a ′ i,t , a − i,t | s t , θ ′ i , θ − i ) . In BMCEs, agents can coordinate their actions to achieve higher expected payoffs. Conceptually, we can imaginethat there is a coordinator that uses π BMCE to provide an action recommendation a i,t speciﬁed by π i ( a i,t | s t , θ i,t ) = P a ′− i,t , θ ′− i π BMCE ( a i,t , a ′− i,t | s t , θ i , θ ′− i ) to each agent i with belief µ i about θ − i , who knows the distribution ofother agents’ actions through π BMCE − i ( a − i,t | s t , θ − i ) . To be an equilibrium, π BMCE is required to incentivize eachagent i to take the recommended action, instead of deviating to another action. However, similar to standard correlatedequilibrium, this coordinator is not required to achieve a BMCE as long as there is a public communication mechanism,e.g., publicly observed information [41, 45]. Consider a discrete-time n -agent information-horizon game M that extends a canonical Bayesian Markov game c M B by expanding the information structure with additional payoff-relevant information, referred to as signals . We referto M as the augmented Bayesian Markov game (A-BMG, augmented game). We consider an environment in whichthere are m sources of signals, denoted as K ≡ [ m ] . Each source l ∈ K of signals sends one signal ω li,t to each agent i at each period t of the game. We refer to the commonly observed state as global state (state) g t . Hence, besides astate g t and private type θ i , each agent i privately observes a group of signals, denoted by W i,t ≡ { ω li,t } l ∈K . However,each agent i selects only one signal ω i,t from the group W i,t . The game M is deﬁned as a tuple: M ≡ < N , K , A , O , { R i } i ∈N > . (7)Here, A is a ﬁnite set of actions each agent can choose from; let A ≡ A n denote the set of joint actions and A − i ≡ A n − denote the set of joint actions of agents other than i . O ≡ < G , Ω m , Θ , T g , P , d g , d θ > is the globalinformation structure, where where G is a ﬁnite set of global states (states); Ω is a ﬁnite set of signals each sourcefrom K can send; Θ ≡ Θ n , where Θ is a ﬁnite set of types and each agent i ’s type θ i is privately observed by agent i (we assume that agents have the same set of types); T g : G × A ∆( G ) is a transition function of the state, suchthat T g ( g t +1 | g t , a t ) speciﬁes the probability of the next state g t +1 when the current global state is g t and current jointactions are a t ; P is the probability measure of the signals W t ≡ { W i,t } i ∈N received by the agents as additionalinformation; d g is the initial distribution of the state; d θ is the prior distribution of each agent’s type. After receiving W i,t , agent i selects one signal ω i,t from W i,t . R i : G × S × A × Ω × Θ R is the reward function that maps the jointactions a t ∈ A , global state g t , agent i ’s selected signal ω i,t , and his type θ i into a scalar reward R i ( a t , g t , ω i,t | θ i ) .Each agent i is rational in the sense that it is self-interested and makes his decisions according to his observation o = ( g t , W i,t , θ t | O ) to maximize his expected payoffs. Here, each agent i privately observes his type θ i and signals W i,t . Hence, only the global state g t is the common information among the observations of the agents. Given theobservation o i = ( g t , W i,t , θ i | O ) , the decision making of each agent i consists of two processes: (i) selecting onesignal ω i,t from W i,t and (ii) choosing an action a i,t from A . The solution to the game M is a proﬁle < β , π , µ > .Here, β : G × Ω m × n × Θ Ω is a selection strategy proﬁle , such that β ( g t , W t , θ ) speciﬁes their choices of8 PREPRINT - F

EBRUARY

16, 2021signals ω t ≡ { ω i,t } i ∈ N for each observation proﬁle o = ( g t , W t , θ t | O ) , and π : G × Ω n × Θ ∆( A ) is a policyproﬁle , such that π ( a t | g t , ω , θ ) speciﬁes the distribution of the next joint actions, for each state g t , choice of jointsignals ω t , and joint types θ . The proﬁles β and π can be either correlated (i.e, a joint function) or independent (i.e., ω i,t = β i ( g t , W i,t , θ i ) , for all i ∈ N , and π = Q i ∈ N π i ). The solution of the augmented game M also requires a belief system µ = { µ i } i ∈N , where µ i : G × Ω × Θ ∆(Ω ( n − × Θ n − ) which describes each agent i ’s beliefabout unobserved signals W − i of other agents and and the unobserved types of other agents θ − i .Given any observation o i = ( g t , W i,t , θ i | O ) , each agent i ’s selection of the signal and the choice of action are funda-mentally different. Speciﬁcally, agent i ﬁrst uses β i to select signal ω i,t = β i ( g t , W i,t , θ i ) and then chooses an action a i,t according to π i ( a i,t | g t , ω i,t , θ i ) (suppose we consider a Nash equilibrium here) based on the realized selection ω i,t . The transition of the global state is controlled by the current g t and the realized actions a t , i.e., T g ( g t +1 | g t , a t ) and, however, is independent of the selected signal ω i,t , for all i ∈ N , given g t , a t .In this work, we are interested in that there is one rational information designer referred to as principal (she, indexedas k ) that controls one of m sources of signals. The principal privately sends a signal ω ki,t to each agent i such that ω kt is distributed according to some probability measure P k ∈ ∆(Ω n ) . We assume that W − kt ≡ W t \{ ω kt } is distributedaccording to some ﬁxed P − k . We consider that the principal is rational in the sense that she strategically chooses P k that governs the realizations of the her signals ω kt at each period t , thereby inﬂuence P of W t = { ω kt , W − kt } ,such that the equilibrium behaviors of the agents coincide with the principal’s desired equilibrium. This process is information design: Deﬁnition 0.7 ( Information Design Problem ) . An information design problem is deﬁned as a tuple

I ≡ . Here, M is a Bayesian Markov game model deﬁned by (7); π is the agents’ policy pro-ﬁle; < P ki , Ω > is the information structure, where P ki deﬁnes a distribution of the signal ω ki,t sent by the principalat each t ; κ : G × Θ ∆( A ) is the principal’s target equilibrium probability distribution of agents’ joint actionconditioning only on the state and agents’ type. A solution to I is a signaling rule proﬁle α : G × Θ ∆( Ω ) that deﬁnes P k ≡ {P ki } i ∈ N of the joint signal ω kt , i.e., α ( ω kt | g t , θ ) speciﬁes the probability distribution of ω kt sent by the principal to the agents at period t , when the stateis g t and the joint types are θ . We will write the augmented game M with the principal using α as M [ α ] and refer toit as α -augmented game .If the information design is viewed as an extensive form game between the principal and the agents, the timing is asfollows :(i) the principal chooses a signaling rule proﬁle α for the agents, each of whom has a privately realized type θ i ;(ii) a state g t is realized at the beginning of each period t and is observed by the principal and all the agents;(iii) the principal privately sends ω ki,t to each agent i and each agent i receives W i,t = { ω ki,t , W − ki,t } ;(iv) the agents chooses their signals ω t (i.e., the selections) from W t according to β ;(v) the agents chooses their actions according to π based on their private types, state, and signal; This extensive form game is different from the Markov game M [ α ] and is described for the purpose of timing the decisionmaking processes of the principal and the agents. The principal is not a player of M [ α ] . PREPRINT - F

EBRUARY

16, 2021(vi) immediate rewards are realized and the state g t is transitioned to g t +1 according to T .Given an α , we re-write the global information structure of M [ α ] as O α = < G , Ω m , Θ , T g , P − k , α , d g , d θ > and anobservation as o = ( g t , ω ki,t , W − ki,t , θ t | O ) . Here, we assume that O α is common knowledge (i.e., common prior) andonly the realizations of θ i , ω ki,t , and W − ki,t , for t ≥ are unobserved by agents other than i . Hence, each agent i is aBayesian decision maker. According to Ionescu Tulcea theorem (see, e.g., Hern ´ andez-Lerma Lasserre [46]), initial distribution d g on g , tran-sition function T g , signaling rule proﬁle α and distribution P − k , selection rule proﬁle β , policy proﬁle π deﬁne aunique probability measure P π , β , α on ( G × Ω m × n × A ) ∞ . Given the belief µ i , the expectation with respect to P π , β , α denoted by E β , α ; µ i π (cid:2) · (cid:3) or E β , α ; µ i π (cid:2) · (cid:12)(cid:12) · (cid:3) . Here, since the global information structure O α is ﬁxed exceptthe signaling rule α , we show the α instead of O α in the notations of expectation. Each agent i ’s decision making isgoverned by its (discounted, < γ ≤ ) cumulative expected reward (expected reward): ExpR π , β , α ; µ i i ( a t , g t , ω i,t ; { ω ki,t , W − ki,t }| θ i ) ≡ X θ − i E β , α ; µ i π h X τ ≥ t γ τ R i (˜ a τ , ˜ g τ , ˜ ω i,τ | θ i ) (cid:12)(cid:12)(cid:12) g t , W i,t , θ − i i d θ ( θ − i ) , (8)where ω i,t = β i ( g t , W i,t , θ i ) and the period- τ selected signal ˜ ω i,τ is a random variable whose distribution is deter-mined by α i , P − k , and β i , and d θ ( θ − i ) = Q j = i d θ ( θ j ) . Agents choose β and π by maximizing the expected reward(8). For notational simplicity, we will remove W − ki,t from W i,t and only show ω ki,t , unless otherwise stated, e.g., ExpR π , β , α ; µ i i ( a t , g t , ω i,t ; ω ki,t | θ i ) = ExpR π , β , α ; µ i i ( a t , g t , ω i,t ; { ω ki,t , W − ki,t }| θ i ) .As in a standard Markov game, each agent i ’s decision of choosing an action a t takes into account other agents’decisions of choosing a − i,t because its immediate reward of taking a i,t directly depends on a − i,t . In M [ α ] , agent i ’schoices of β i and π i are coupled because ω i,t speciﬁed by β i has a direct causal effect on a i,t through π i . Thus, otheragents’ immediate reward indirectly depends on each individual agent’s selected signal through his action. Hence,agents’ strategic interactions in M [ α ] consist of selecting signals by β and taking actions by π . Since P − k is ﬁxed,the principal’s choice of α controls the dynamics of W t . Therefore, it is possible for the principal to inﬂuence theequilibrium behaviors of agents in the game M [ α ] through proper designs of α .The principal’s information design problem I is a mechanism design problem that takes an objective-ﬁrst approachto design information structures of signals sent to agents, toward desired objectives κ , in a strategic setting throughthe design of α , where self-interested agents act rationally by choosing β and π . The choice of α is independentof the realizations of states and agents’ types. The key restriction on the principal’s α is that the agents are elicitedto perform equilibrium behaviors π that coincides with the principal’s desired equilibrium κ . This is captured by anotion of implementability : Deﬁnition 0.8 ( Implementability ) . Given κ , M , and P − k , the signaling rule proﬁle α is implementable if it elicitsa strategy proﬁle < β ∗ , π ∗ > that is an equilibrium and π ∗ is admissible equilibrium π ∗ of M [ α ] , i.e., for all i ∈ N , t ∈ T , ( g t , θ i , θ − i ) ∈ G × Θ , X ω kt , W − kt π ∗ (cid:0) a t | g t , β ∗ ( g t , ω kt , θ ) , θ (cid:1) α ( ω kt | g t , θ ) × P − k ( W − kt ) = κ ( a t | g t , θ ) . (9) In this case, we say that agents’ equilibrium < β ∗ , π ∗ > implements α . PREPRINT - F

EBRUARY

16, 2021Given any P − k , the distribution of a conditioning on any state g is jointly determined by the agents’ β and theprincipal’s α . Hence, given P − k , the signal ω kt sent by the principal by using α ultimately inﬂuences each agent’expected reward. However, this information is transmitted indirectly through the agents’ selection rules, i.e., ω t = β ( g t , ω kt | θ ) , where ω t is not necessarily equal to ω − kt . Therefore, the information design problem I aiming to ﬁndimplementable α leads to a indirect information design (IID). We will call the game M − D [ α ] as indirect augmentedgame . As a designer, the principal takes into consideration each agents decision makings at all aspects of the game. Hence,in any indirect augmented game M − D [ α ] , the principal’s design of α must predict the possible equilibrium selectionrule proﬁle β and the corresponding equilibrium π that might be indirectly induced by α . In contrast to M − D [ α ] , theprincipal may elect to a direct information design . In direct information design, the principal manipulates the agents’equilibrium by directly touching their policies, when each agent selects the signal sent by the principal at each state. Deﬁnition 0.9 ( Direct Information Design ) . A direct information design induces a direct augmented game M D [ α ] ,in which agents select ω kt sent by α D at every period, i.e., β ( g t , { ω kt , W − kt } , θ ) = ω kt , for all g t ∈ G , { ω kt , W − kt } ∈ Ω m × n , θ ∈ Θ , t ∈ T . In a M D [ α ] , the principal wants ω kt to directly enter the immediate rewards of the agents at each period t . The keyconstraint on the I of M [ α ] D is a notion of obedient implementability that requires the design of α to be such that (i) agents would want to select the signal ω kt sent by the principal than choose any other signals from W − kt , for each t ∈ T , and (ii) agents take actions speciﬁed by the admissible policy other than other available actions. Deﬁnition 0.10 (OIL) . Given κ , the signaling rule α OIL is obedient-implementable (OIL, obedient-implementability)if it induces < β O , π O > , such that(i) β O is obedient , i.e., for all i ∈ N , t ∈ T , ( g t , θ i , { ω ki,t , W − ki,t } ) ∈ G × Θ × Ω m , β O ( g t , { ω kt , W − ki,t } , θ ) = ω kt ; (10) (i) π O is admissible , i.e., for all i ∈ N , t ∈ T , ( g, θ i , θ − i ) ∈ G × Θ , π O is an equilibrium and X ω k π O ( a | g, ω k , θ ) α OIL ( ω k | g, θ ) = κ ( a | g, θ ) . (11)Next, we introduce the obedient perfect Bayesian Markov Nash equilibrium (O-PBME) as the equilibrium solutionconcept for the direct augmented game M D [ α ] . Given β with ω i,t = β i ( g t , ω ki,t | θ i ) and ω − i,t = β i ( g t , ω k − i,t | θ − i ) and each agent i ’s belief µ i , deﬁne π [ − i ] ( a − i,t | g t , ω i , θ i ; µ i , β ) ≡ X ω i,t , θ − i Y j = i π j ( a j,t | g t , β j ( g t , ω kj,t , θ j ) , θ j ) µ i ( ω k − i,t , θ − i | g t , ω kt ) d θ ( θ − i ) , (12)such that π [ − i ] is the joint policy of all other agents as perceived by agent i given µ i and β , which is independent of a i,t . The optimality criterion of each agent i is captured by his expected payoff: ExE π , β , α ; µ i i ( a i,t ; g t , ω i,t ; ω ki,t | θ i ) ≡ E ˜ a − i,t ∼ π [ − i ] (˜ a − i,t | g t ,ω i ,θ i ; µ i , β ) h ExpR π , β , α ; µ i i ( a i,t , ˜ a t , g t , ω i,t ; ω ki,t | θ i ) i . (13)11 PREPRINT - F

EBRUARY

16, 2021The belief system µ is consistent if it is updated according to the Bayes’ rule: µ i ( ω − i,t , θ − i | g t , W i,t , θ i ; β − i ) = α ( ω ki,t , ω k − i,t | g t , { θ i , θ − i } ) P − k ( W − k − i,t ) d θ ( θ − i ) P ˆ W − i,t , ˆ θ − i α ( ω ki,t , ˆ ω k − i,t | g t , { θ i , θ − i } ) P − k ( ˆ W − k − i,t ) d θ (ˆ θ − i ) . (14)When the denominator of (14) is zero, then agent i sets any probabilistic belief about < ω − i,t , θ − i > . We formallydeﬁne the O-PBME as follows. Deﬁnition 0.11 (O-PBME) . A proﬁle < β ∗ , π ∗ > with µ constitutes a PBME if the belief µ is updated accordingto (14) and the the policy proﬁle is independent, i.e., π ∗ ( a t | g t , ω t , θ ) = Q i ∈N π ∗ i ( a i,t | g t , ω i,t , θ i ) , such that, for any g t ∈ G , θ i ∈ Θ , ω ki,t ∈ Ω with α i ( ω ki,t | g, θ i ) > , ω i,t = β i ( g t , ω ki,t , θ i ) , ω ′ i,t ∈ Ω , a i,t ∈ A with π ∗ i ( a i,t | g t , ω i,t , θ i ) > , a ′ i,t ∈ A , i ∈ N , ExE π ∗ , β ∗ , α ; µ i i ( a i,t ; g t , ω i,t ; ω ki,t | θ i ) ≥ ExE π ∗ , β , α ; µ i i ( a ′ i,t ; g t , ω ′ i,t ; ω ki,t | θ i ) . (15) The PBME proﬁle is an O-PBME if the selection rule proﬁle β ∗ is obedient. We denote O-PBME equilibrium proﬁleas < β O , π O > with µ and refer to the signaling rule as OIL in O-PBME (OIL-P), denoted as α OIL , if it inducesan O-PBME in which the equilibrium policy proﬁle π OP is admissible, denoted as π AO . We call < β O , π AO > as anadmissible O-PBME. A successful information design depends on the principal’s having accurate beliefs in regard to the agents’ decisionprocesses. This includes all the possible indirect selection behaviors of the agents, i.e., all possible β = β O . Thepoint of direct information design is that it allows the principal to ignore analyzing all of agents’ indirect selectionsbehaviors and focus on the obedient β O . This is promoted by the obedience principle: Theorem 1 ( Obedience Principle ) . Let < β ∗ , π ∗ > with µ implement an indirect α − D in a PBME that achievesa goal κ . Then, there exists a direct information design with OIL-implementable α OIL in PBME that induces anequilibrium < β O , π AO > of the game M [ α OIL ] that achieves the same goal κ . The obedience principle shows that for any indirect information design that achieves a goal, there exists a directinformation design that leads to the same goal. Hence, it is without loss of generality for the principal to focus ondirect information design, in which the agents’ obedient selection strategy β O is straightforward. In this section, we characterize the OIL-P of the signaling rule α OIL that constrains the principal’s information designproblem. Since we restrict attention on stationary equilibrium strategies, we omit the time index unless otherwisestates. M [ α ] Given a game M [ α ] for some α , let T π , β , α g ′ g denote the transition probability from state g to state g ′ , given that agentschooses action according to π and selects signals according to β , with a slight abuse of notation: T π , β , α g ′ g ≡ X a , ω k , ω π ( a | g, ω , θ ) α ( ω k | g, θ ) T g ( g ′ | g, a ) . PREPRINT - F

EBRUARY

16, 2021Given α , β , π , and the belief system µ , deﬁne the state-signal value function V π , β , α ; µ i of agent i , representing agent i ’s expected reward, originating at ( g, ω ki , ω i ) ∈ G × Ω × Ω with ω i = β i ( g, ω ki | θ i ) : V π , β , α ; µ i i ( g, ω i ; ω ki | θ i ) ≡ E β ; µ i h ∞ X t =0 X g ′ γ t (cid:0) T π , β , α g ′ g (cid:1) t X a π ( a | g, ω ; θ ) α ( ω k | g, θ ) R i ( a , g, ω i | θ i ) i . (16)Deﬁne the state value function J π , β , α ; µ i i of agent i that describes his expected reward, originating at state g ∈ G : J π , β , α ; µ i i ( g | θ i ) ≡ E β ; µ i h ∞ X t =0 X g ′ γ t (cid:0) T π , β , α g ′ g (cid:1) t X a , ω k , ω π ( a | g, ω ; θ ) α ( ω k | g, θ ) R i ( a , g, ω i | θ i ) i . (17)Next, deﬁne the Q -value function (the state-signal-action value function) Q π , β , α ; µ i i that represents agent i ’s expectedreward if ( ω , a ) ∈ Ω n × A are played in ( g, ω k ) ∈ G × Ω n : Q π , β , α ; µ i i ( g, ω i , a ; ω ki | θ i ) ≡ E β ; µ i " R i ( a , g, ω i | θ i ) + γ X g ′ T g ( g ′ | g, a ) (cid:16) ∞ X t =0 X g ′′ γ t (cid:0) T π , β , α g ′′ g ′ (cid:1) t × X a ′ , ω k ′ , ω ′ π ( a ′ | g ′′ , ω ′ ; θ ) R i ( a ′ , g ′′ , ω ′ i | θ i ) (cid:17) . (18)For simplicity, we will remove combine the agent’s obedient selection and the signal sent by the principal for the restof the paper unless otherwise stated, e.g., V π , β , α ; µ i i ( g, ω ki | θ i ) = V π , β , α ; µ i i ( g, ω ki ; ω ki | θ i ) . The following propositionstated without proof is an analog of Bellman’s Theorem [47]. Proposition 1.1.

Given a game M [ α ] with any stationary α , stationary < β , π > and µ , for any V i : G × Ω × Ω R , J i : G 7→ R , Q i : G × Ω × A × Ω R , V i = V π , β , α ; µ i i : G × Ω × Ω R , J α i = J π , β , α ; µ i i : G 7→ R , and Q i = Q π , β , α ; µ i i : G × Ω × A × Ω R , if and only if: V i ( g, ω i ; ω ki | θ i ) = E β ; µ i h X a π ( a | g, ω ) Q i ( g, ω i , a ; ω ki | θ i ) i , (19) J α i ( g | θ i ) = E β ; µ i h X ω i ,ω ki α i ( ω ki | g, θ ) V i ( g, ω i ; ω ki | θ i ) i , (20) Q i ( g, ω i , a ; ω k | θ i ) = R i ( a , g, ω i | θ i ) + E β ; µ i h γ X g ′ ,ω ′ i ,ω k ′ i T g ( g ′ | g, a ) α i ( ω ki | g, θ ) V i ( g ′ , ω ′ i ; ω k ′ i | θ i ) i . (21)From Proposition 1.1, we can reformulate V π , β , α ; µ i i , J π , β , α ; µ i i , and Q π , β , α ; µ i i given in (16)-(18), respectively, recur-sively such that (19)-(21) are satisﬁed.The following proposition characterizes any PBME < β , π > in M [ α ] for any α . Proposition 1.2.

In any M [ α ] , a stationary strategy proﬁle < β , π > with µ is a PBME if and only if, for all i ∈ N , ω k ∈ Ω with α i ( ω k | g, θ ) > , ( ω i , a i ) ∈ Ω × A with ω i = β i ( g, ω ki | θ i ) and π i ( a i | g, ω i ) > , ω ′ i ∈ Ω , a ′ i ∈ A , E a − i ∼ π − i h Q π , β , α ; µ i i ( g, ω i , a i , a − i ; ω ki | θ i ) i ≥ E a − i ∼ π − i h Q π , β , α ; µ i i ( g, ω ′ i , a ′ i , a − i ; ω ki | θ i ) i . (22)Proposition 1.2 establishes a one-shot deviation principle . In particular, if it is optimal for the agent i to followthe equilibrium selection rule β i and policy π for a given observation o i when his behaviors for situations o i hastransitioned from and will transition to follow the equilibrium, then it is also optimal for him to follow the equilibriumeven if he has deviated in the past and will deviate in the future situations. This implies that we can restrict attentionto the characterization of the equilibrium to its robustness to one-shot deviation. We have the following lemma.13 PREPRINT - F

EBRUARY

16, 2021

Lemma 2.

In a stationary PBME of a game M [ α ] , < β ∗ , π ∗ > , the following holds: for all i ∈ N , ω k ∈ Ω with α i ( ω k | g, θ ) > , ( ω i , a i ) ∈ Ω × A with ω i = β i ( g, ω ki , θ i ) and π i ( a i | g, ω i ) > , ω ′ i ∈ Ω , a ′ i ∈ A , V π , β , α ; µ i i ( g, ω i ; ω k | θ i ) ≥ E β ; µ i a − i ∼ π − i h Q π , β , α ; µ i i ( g, ω ′ i , a ′ i , a − i ; ω ki | θ i ) i , (23) J π , β , α ; µ i i ( g | θ i ) ≥ E β ; µ i h X ω ki α i ( ω k | g, θ ) V π , β , α ; µ i i ( g, ω ′ i ; ω k | θ i ) i . (24)In Lemma 2, the right hand side (RHS) of (23) is the state-signal-action value of < β , π > with other agents’ actions a − i averaged out when there are arbitrary deviations ( ω ′ i , a ′ i ) from < β , π > . The RHS of (24) is the expected state-signal value with the proﬁle < β , π > of < β , π > with the expectation taken over the agent i ’s selected signal whenthere is an arbitrary deviation ω ′ i from β . Here, (23) says that a PBME requires each agent i ’s V π , β , α ; µ i i is robust toany action deviation in terms of Q π , β , α ; µ i i , when all other agents are playing equilibrium proﬁle < β − i , π − i > ; (24)says that each agent i ’s J π , β , α ; µ i i is robust to any signal selection deviation captured by V π , β , α ; µ i i , when all otheragents are playing equilibrium proﬁle < β − i , π − i > . In a M [ α OIL ] , the principal designs her signaling rule α OIL such that β Oi is obedient and π AO is admissible for eachagent i . Given a κ and an α , deﬁne, for any g ∈ G , θ i ∈ Θ , ¯ V κ , α i ( g ; V i | θ i ) ≡ E µ i h X ω ki V i ( g, ω ki | θ i ) α i ( ω ki | g, θ ) i . (25)Motivated by a fundamental formulation of a Nash equilibrium as a nonlinear program (see, Theorem 3.8.2 of [48]),we obtain the following theorem that characterizes the OIL-P of the principal’s information design problem. Theorem 3.

Suppose that the principal’s goal is κ . A signaling rule α OIL is OIL-P if and only if it induces anO-PBME < β O , π AO > with µ i and the corresponding V α OIL satisfying (19)-(21), that is the global minimum of thefollowing constrained optimization problem with Z OIL-P ( π AO , β O , V α OIL ; α OIL , κ ) = 0 : min π , β , V Z OIL-P ( π , β , V ; α OIL , κ ) ≡ X i ∈N ,g,ω i ¯ V κ , α OIL i ( g ; V i | θ i ) − E ω ki ∼ α OIL i a ∼ π h Q π , β , α OIL i ( g, ω i , a ; ω ki | θ i ) i , (26) such that, for all i ∈ N , g ∈ G , ω ki ∈ Ω with α OILi ( ω ki | g, θ i ) > , θ i ∈ Θ , a i ∈ A , any β ′ i , any π ′ i V i ( g, ω ki | θ i ) ≥ E a − i ∼ π − i h Q π ′ i , π − i , β , α OIL i ( g, ω ki , a i , a − i | θ i ) i , (27) J α OIL i ( g | θ i ) ≥ E β h X ω ki α OIL i ( ω ki | g, θ ) V i ( g, β ′ i ( g, ω ki , θ i ); ω ki | θ i ) i , (28) ¯ V κ , α OIL i ( g ; V i | θ i ) = E α OIL i π h Q π , β , α OIL i ( g, ω ki , a | θ i ) i , (29) where J α OIL i and Q π , β , α OIL i are constructed according to (20) and (21) in terms of V i . In Theorem 3, the optimization problem has three decision variables, π , β , and V . It is straightforward to see that theobjective function in (26), Z OIL-P ( π AO , β O , V O ; α OIL , κ ) = 0 , at an O-PBME, given a principal’s goal κ . Hence,in an O-PBME, only the three constraints remain in the constrained in the optimization problem. The constraint (27)requires that given the obedient β O , each agent i has no incentive to deviate from an equilibrium policy proﬁle π by14 PREPRINT - F

EBRUARY

16, 2021any arbitrary deviation a i . In other words, the constraint (27) guarantees that the signaling rule α OIL and the obedient β O lead the agents to an PBME. The constraint (28) requires that for any equilibrium policy proﬁle, the obedient β O is always preferred by the agents than any other selection rule. Finally, the constraint (29) requires that the equilibriumpolicy π AO in the PBME with obedient β O is admissible given κ . Together, the constraints (27), (28), and (29)guarantee that obedient β O and the corresponding policy proﬁle π AO constitute an admissible O-PBME, such that thedesign of α O incentivizes the agents to behave according to < β O , π AO > instead of choosing any other non-obedientequilibrium < β ∗ , π ∗ > or any arbitrary deviations.For a signaling rule α and a proﬁle < β , π > , we deﬁne the occupancy measure , denoted as ρ α , βπ , of the α -augmentedgame M [ α ] as follows: for any g ∈ G , a ∈ A , ω ∈ Ω n , ω k ∈ Ω n , θ ∈ Θ , ρ α , βπ ( g, a , ω , ω k | θ ) ≡ α ( ω k | g, θ ) ρ βπ ( g, a , ω | ω k , θ ) , (30)where ρ βπ ( g, a , ω | ω k , θ ) ≡ π ( a | g, ω , θ ) × ∞ X t =0 γ t P ( g t = g, β ( g, ω k , θ ) = ω | ω kt = ω k ; β , π ) , (31)is the signal-conditioned occupancy measure For notational compactness, we let ρ α , βπ ( g, a , ω k | θ ) = ρ α , βπ ( g, a , ω k , ω k | θ ) , unless otherwise stated. Given the belief system µ , the occupancy measure perceived by eachagent i is given as follows: ρ α , β ; µ i [ i ] , π ( g, a , ω i , ω ki | θ i ) ≡ E µ i h ρ α , βπ ( g, a , ω , ω k | θ ) i , (32)with ρ α , β ; µ i [ i ] , π ( g, a , ω ki | θ i ) = ρ α , β ; µ i π , [ i ] ( g, a , ω ki , ω ki | θ i ) . Similar, given the principal’s goal κ , we can deﬁne the occu-pancy measure with respect to κ : ρ κ ( g, a | θ ) ≡ P ∞ t =0 P κ ( g t = g, a t = a | θ ) , where P κ represents the probabilitygiven κ and the transition of the global state. Similarly, we deﬁne the occupancy measure associated with the princi-pal’s goal κ as ρ κ ( g, a | θ ) ≡ κ ( a | g, θ ) ∞ X t =0 P ( g t = g | θ ; κ ) . (33)We extends the basic result known as Bellman ﬂow constraints of Markov decision process (see, e.g., [49, 50, 51]) tothe M [ α ] and deﬁne the following set of occupancy measures: D [ α ; θ ] ≡ ( ρ : ρ ≥ , and X a , ω ρ (cid:0) g, a , ω , ω k (cid:12)(cid:12) θ (cid:1) = α ( ω k | g, θ ) (cid:16) d g ( g ) + γ X g ′ , a ′ , ω ′ ρ (cid:0) g ′ , a ′ , ω ′ , ω k (cid:12)(cid:12) θ (cid:1) T g ( g | g ′ , a ′ ) (cid:17)) . (34)We have the following corollary that characterizes the results in Theorem 3 in terms of the occupancy measure. Corollary 3.1.

Fix a κ . The signaling rule α ∗ is OIL-P if and only if there exists a ρ ∗ ∈ D [ α ∗ ; θ ] that solves thefollowing constrained optimization problem: max ρ ∈D [ α ∗ ; θ ] X g, a , ω k X i ∈N R i ( g, a , ω ki | θ i ) ρ ( g, a , ω k | θ ) , (35) such that, for any g ∈ G , ω k ∈ Ω n , a ∈ A , θ ∈ Θ , P a ′ ρ (cid:0) g, a ′ , ω k (cid:12)(cid:12) θ (cid:1)P a ′ , ω , ρ (cid:0) g, a ′ , ω , ω k (cid:12)(cid:12) θ (cid:1) = 1 , (36)15 PREPRINT - F

EBRUARY

16, 2021 X ω k ρ ( g, a , ω k | θ ) α ( ω k | g, θ ) = ρ κ ( g, a | θ ) . (37) For each ρ ∗ that solves (35)-(37), the admissible O-PBME policy proﬁle π AOρ ∗ is uniquely given as, for any g ∈ G , ω k ∈ Ω n , a ∈ A , θ ∈ Θ , π AOρ ∗ ( a | g, ω k , θ ) = ρ ∗ ( g, a , ω k | θ ) P a ′ ρ ∗ ( g, a ′ , ω k | θ ) . (38)Corollary 3.1 extends the basic linear programming formulation of Markov decision process in terms of the occu-pancy measure (see, e.g., [49, 50, 51]) to the game M [ α ] in O-PBME and formulate the constrained optimizationproblem (26)-(29) in Theorem 3 as an occupancy measure selection problem. Choosing an occupancy measure fromthe set D O [ α ∗ ; θ ] captures the Bellman ﬂow constraints [50] of the game M [ α ∗ ] when the agents use obedient β O .The constraint (36) requires the feasible occupancy measures are those with obedient β O , i.e., the probability of β ( g, ω k , θ ) = ω k is . The second constraint (37) guarantees the admissibility of the optimal policy proﬁle asso-ciated with the optimal occupancy measure ρ ∗ . Here, (38) is from a basic result that for each occupancy measure ρ ∈ D O [ α ∗ ; θ ] , there is a unique policy proﬁle that can be constructed in terms of occupancy measure.Next, we extend the occupancy measure and deﬁne the t -sequential occupancy measure perceived by each agent i ,denoted as λ α , βπ , [ i ] ,t , which is the distribution of sequences of state-action-signals of length t that the agents encounterswhen using the equilibrium proﬁle < β , π > , given a signaling rule α . Deﬁne trajectory of length t − τ + 1 x ( t ) i ; τ ≡{ x i,t , x i,t − , . . . , x i,τ } and x ( t ) τ ≡ { x t , x t − , . . . , x τ } , respectively, for the term x and the joint x = { x i } i ∈N of x ,for x = g, a, ω, ω k with ω k ;( t ) i ; τ and ω k ;( t ) τ , i ∈ N . Deﬁne the sequence of length t as h τ : t [ g ( t ) τ , ω k ;( t ) τ , ω ( t ) τ , a ( t ) τ ] ≡ { ( g τ , ω kτ ) , ( ω τ , a τ , g τ +1 , ω kτ +1 ) , . . . , ( ω τ + t − , a τ + t − , g τ + t , ω kτ + t ) } ∈ H t , with h τ : t [ g ( t ) τ , ω k ;( t ) τ , a ( t ) τ ] = h τ : t [ g ( t ) τ , ω k ;( t ) τ , ω k ;( t ) τ , a ( t ) τ ] , where where H t ≡ G × Ω n × (Ω n × A × G × Ω n ) t .Similar, we deﬁne ℓ τ : t [ g ( t ) τ , ω k ;( t ) τ , ω ( t ) τ ] ∈ H \ a t as h t [ g ( t ) , ω k ;( t ) , ω ( t ) , a ( t ) ] without trajectory of actions a ( t ) , where H \ a t ≡ G × Ω n × (Ω n × G × Ω n ) t . For notational compactness, we simplify the sequences by only showing h t and ℓ t without trajectories or show some speciﬁc trajectories or elements for the purpose of highlight (e.g., a realizedsequence h t can be written as h t [ a t ] or h t [ a t , ω kt ] ), unless otherwise stated. We will write ℓ τ : τ + t or ⊂ h τ : τ + t if ℓ τ : τ + t ⊂ h τ : τ + t [ a t ] if ℓ τ : τ + t is h τ : τ + t [ a t ] without a t . When τ = 0 , the time index of τ is ignored in the notations ofthe trajectory and the sequences, e.g., x ( t ) = x ( t )0 and h t = h t .Formally, the t -sequential occupancy measure is deﬁned as, for any h t ∈ H t , θ ∈ Θ , λ α , βπ ,t ( h t | θ ) ≡ ∞ X τ =0 P α , βπ (cid:0) h τ : τ + t = h t | θ (cid:1) . (39)Similarly, given any g ( t ) ⊂ ℓ t ⊂ h t [ a ( t ) ] ∈ H t , we have ¯ λ α , βπ ,t ( ℓ t | θ ) ≡ P a ′ λ α , βπ ,t ( h t [ a ′ ] | θ ) For any ˜ h t = { ˜ h i,t } i ∈N = { (˜ g τ , ˜ ω ki,τ , ˜ ω i,τ , ˜ a τ } ti ∈N ,τ =0 ∈ H t , we denote, for any τ ≥ , t ≥ , R ( τ + t ) i ; τ (˜ h i,t | θ i ) ≡ t X s =0 γ τ + s R i ( g τ + s = ˜ g s , a τ + s = ˜ a s , ω i,τ + s = ˜ ω i,s | θ i ) . Then, the following holds: E α , βπ h R i ( g ′ , a ′ , ω ′ i | θ i ) i = X h t ∈ H t ,i ∈N R ( t ) i ( h i,t | θ i ) λ α , βπ ,t ( h t | θ ) . = X g, a , ω k X i ∈N R i ( g, a , ω ki | θ i ) ρ β , βπ ( g, a , ω k | θ ) . PREPRINT - F

EBRUARY

16, 2021Given the belief µ i , we let λ α , β ; µ i [ i ] , π ,t ( h i,t | θ i ) denote the t -sequential occupancy measure perceived by each agent i : λ α , β ; µ i [ i ] , π ,t ( h i,t | θ i ) = E µ i h P ω k ;( t ) − i , ω ( t ) − i λ α , βπ ,t ( h t | θ ) i . Analogously, we can deﬁne ¯ λ α , β ; µ i [ i ] , π ,t .Let h i,t denote any sequence of length t (perceived by agent i ), t > . For any t ′ with ≤ t ′ < t , we write h i,t = h i,t ′ ⊕ h i,t ′ +1: t such that h i,t ′ is the ﬁrst t ′ components of h i,t and h i,t ′ +1: t is the last t − t ′ sequence of h i,t ; ⊕ is not symmetric in general, i.e., h i,t + h ′ i,t = h ′ i,t + h i,t . Given any two sequences h i,t and h ′ i,t ′ of lengths t and t ′ and any sequence ℓ i,t ′′ of length t ′′ , the transition functions of the sequences can be formulated as follows: T π , β , α h i,t ′ ,h i,t = T h ( h i,t ′ | h i,t ; θ i ) ≡ λ α , β ; µ i [ i ] , π ,t + t ′ ( h i,t ⊕ h i,t ′ | θ i ) λ α , β ; µ i [ i ] , π ,t ( h i,t | θ i ) , T π , β , α ℓ i,t ′′ ,h i,t = T ℓh ( ℓ i,t ′′ | h i,t ) ≡ P a ( t ′′ ) λ α , β ; µ i [ i ] , π ,t + t ′′ ( h i,t ⊕ ℓ i,t ′′ ∪ { a ( t ′′ ) }| θ i ) λ α , β ; µ i [ i ] , π ,t ( h i,t | θ i ) . Hence, T π , β , α h i,t ′ ,h i,t and T π , β , α ℓ i,t ′′ ,h i,t , respectively, give the probability of the next sequences h i,t ′ of length t ′ and ℓ i,t ′′ oflength t ′′ , given the current sequence h i,t . The sequence transition function, T ℓ ′ i,t ′ ,ℓ i,t , associated with ¯ λ α , βπ , [ i ] ,t can bedeﬁned in the same way, for any two sequences ℓ ′ i,t ′ ∈ H \ a i,t ′ and ℓ i,t ∈ H \ a i,t , t, t ′ ≥ .Given any h t = { h i,t } i ∈N ∈ H t , g t ⊂ ℓ t ⊂ h t , for any t ≥ , we deﬁne the following value functions, Q π , β , α ; µ i i | t,t ′ and V π , β , α ; µ i i | t,t ′ , with a slight abuse of notation: for any t, t ′ , t ′′ ≥ , for all i ∈ N , Q π , β , α ; µ i i | t,t ′ ( h i,t | θ i ) ≡ R ( t ) i ( h i,t | θ i ) + X ℓ ′ i,t ′ T π , β , α ℓ ′ i,t ′ ,h i,t V π , β , α ; µ i i | t ′ ,t ′′ ( ℓ ′ i,t ′ | θ i ) , (40) V π , β , α ; µ i i | t,t ′ ( ℓ i,t | θ i ) ≡ E µ i π h Q π , β , α ; µ i i | t,t ′ ( h i,t [ a t ] | θ i ) i(cid:12)(cid:12)(cid:12) ℓ i,t ⊂ h i,t . (41)We refer to (40) and (41) as the extended Bellman equations .The following lemma shows an asymptotic relationship between the regular and the sequential occupancy measures. Lemma 4.

Given any < β , π > and α , the following holds: for all i ∈ N , lim t − t ′ →∞ X h i,t Q π , β , α ; µ i i | t,t ′ ( h i,t | θ i ) λ α , β ; µ i [ i ] , π ,t ( h i,t | θ i ) = lim t − t ′ →∞ X ℓ i,t V π , β , α ; µ i i | t,t ′ ( ℓ i,t | θ i )¯ λ α , β ; µ i [ i ] , π ,t ( ℓ i,t | θ i )= X g, a ,ω i ,ω k R i ( g, a , ω i | θ i ) ρ α , β ; µ i [ i ] , π ( g, a , ω i , ω ki | θ i ) . (42)The following proposition re-write the constraints (27) and (28) in Theorem 3 in terms of Q π , β , α ; µ i i | t,t ′ and V π , β , α ; µ i i | t,t ′ . Proposition 4.1.

Fix a κ . Let h i,t ≡ { g , a , ω i, , ω ki, } ⊕ h i, t ∈ H i,t and ℓ i,t ⊂ h i,t , for any t ≥ . The signalingrule α OIL is OIL-P by < β O , π AO > if and only if π AO is admissible and, for any ˆ a ( t ) i ∈ A t , ˆ ω ( t ) i ∈ Ω t , t, t ′ ≥ , V π AO , β O , α OIL i ( g , ω ki, | θ i ) ≥ E µ i π AO − i h Q π i , π AO − i , β O , α ; µ i i | t,t ′ ( h i,t [ˆ a ( t ) i , a t − i ] | θ i ) i , (43) J π AO , β O , α OIL i ( g | θ i ) ≥ E α OIL h V π AO ,β i , β O − i , α ; µ i i | t,t ′ ( ℓ i,t [ˆ ω ( t ) i , ω k ;( t ) i ] | θ i ) i , (44) where V π AO , β O , α OIL i , J π AO , β O , α OIL i satisfy (19)-(21) in Proposition 1.1, Q π , β , α ; µ i i | t,t ′ and V π , β , α ; µ i i | t,t ′ are given in (40)and (41), respectively. PREPRINT - F

EBRUARY

16, 2021Given the principal’s goal κ , the restrictions imposed by α OIL on β and π through (27)-(28) are equivalent to (43)-(44)in Proposition 4.1, respectively. Theorem 3 shows that an OIL-P signaling rule α OIL leads to an O-PBME solves theconstrained problem with Z OIL-P ( π AO , β O , V α OIL ; α OIL , κ ) = 0 . We introduce the slack variables , δ β ∗ , απ ∗ ; t,t ′ [ π i ] ≥ and ζ α OIL t,t ′ [ β i ] ≥ , to make the inequality constraints (43) and (44) be equality, for each deviation π i and β i ,respectively. Let L β ∗ , απ ∗ ( Λ t,t ′ , Ξ t,t ′ , V ∗ ; δ α OIL t,t ′ , ζ α OIL t,t ′ | θ ) denote the Lagrangian of the problem (26) with constraints(43), (44), and (29), where Λ t,t ′ ≡ { Λ π i i ; t,t ′ } i ∈N ,π i , Ξ ≡ { Ξ β i i } i ∈N ,β i are the dual variables associated with theconstraints (43) and (44), for all possible deviations π = { π i } i ∈N and β = { β i } i ∈N respectively; and δ α OIL t,t ′ ≡{ δ α OIL t,t ′ [ π i ] } π i ,i ∈N and ζ α OIL t,t, ≡ { ζ α OIL t,t ′ [ β i ] } β i ,i ∈N . Hence, the Lagrangian of the problem in Theorem 3 at anadmissible O-PBME < β O , π AO > takes the following form: for any t, t ′ ≥ , L β O , απ AO ( Λ t,t ′ , Ξ t,t ′ , V π AO , β O , α ; δ α t,t ′ , ζ α t,t ′ | θ ) ≡ X i ∈N ,π i Λ π i i ; t,t ′ δ α OIL t,t ′ [ π i ] − V π AO , β Ot,t ′ , α i ( g , ω ki, | θ i )+ E µ i π AO − i h Q π i , π AO − i , β O , α ; µ i i | t,t ′ ( h i,t [ˆ a ( t ) i , a t − i ] | θ i ) i! + X i ∈N ,β i Ξ β i i ; t,t ′ ζ α OIL t,t ′ [ β i ] − J π AO , β O , α i ( g | θ i )+ E α OIL h V π AO ,β i , β O − i , α ; µ i i | t,t ′ ( ℓ i,t [ˆ ω ( t ) i , ω k ;( t ) i ] | θ i ) i! , (45)Due to the Lagrangian sufﬁciency theorem (see, e.g., [52]), one way to design an OIL-P signaling rule α is to makesure that there exist a pair Λ t,t ′ and Ξ t,t ′ , such that min π ∗ , β ∗ , V ∗ L β ∗ , απ ∗ ( Λ t,t ′ , Ξ t,t ′ , V ∗ ; δ α t,t ′ , ζ α t,t ′ | θ ) = L β O , απ AO ( Λ t,t ′ , Ξ t,t ′ , V π AO , β O , α ; δ α t,t ′ , ζ α t,t ′ | θ ) . The Lagrangian sufﬁciency theorem states that, with such α , the admissible O-PBME is also an optimal solu-tion of the problem in Theorem 3. From the deﬁnition of admissible O-PBME, it is straightforward to see that L β O , απ AO ( Λ t,t ′ , Ξ t,t ′ , V π AO , β O , α ; δ α t,t ′ , ζ α t,t ′ | θ ) = 0 ; i.e., the slack variables can be written as: for any t, t ′ ≥ , δ β O , απ AO ; t,t ′ [ π i ] = V π AO , β O , α i ( g , ω ki, | θ i ) − E µ i π AO − i h Q π i , π AO − i , β O , α ; µ i i | t,t ′ ( h i,t [ˆ a ( t ) i , a t − i ] | θ i ) i(cid:12)(cid:12)(cid:12) Λ t,t ′ , Ξ t,t ′ , (46) ζ β O , απ AO ; t,t ′ [ β i ] = J π AO , β O , α i ( g | θ i ) − E α OIL h V π AO ,β i , β O − i , α ; µ i i | t,t ′ ( ℓ i,t [ˆ ω ( t ) i , ω k ;( t ) i ] | θ i ) i(cid:12)(cid:12)(cid:12) Λ t,t ′ , Ξ t,t ′ . (47)Here, we add the proﬁle < β O , π AO > in the superscript and the subscript of the slack variables to show theirdependence on < β O , π AO > . Hence, < β O , π AO , V π AO , β O , α > and δ α in (46) and ζ α in (47) make the constraints(43) and (44) binding.Deﬁne, for any β , π , g ∈ G , a ∈ A , ω ∈ Ω n , ω k ∈ Ω n , θ ∈ Θ , U βπ ( g, a , ω | ω k , θ ) ≡ X i ∈N R i ( g, a , ω i | θ i ) ρ βπ ( g, a , ω | ω k , θ ) , (48)with U βπ ( g, a , ω k | θ ) = U βπ ( g, a , ω k | ω k , θ ) , where ρ βπ ( ·| ω k , θ ) is given in (31).Based on the result of Lemma 4, we have the following proposition. Proposition 4.2.

Let Λ ∗ t,t ′ and Ξ ∗ t,t ′ denote the dual variables such that L β O , απ AO ( Λ t,t ′ , Ξ t,t ′ , V π AO , β O , α ; δ β O , απ AO ; t,t ′ , ζ β O , απ AO ; t,t ′ | θ ) = 0 for δ β O , απ AO ; t,t ′ [ π i ] and ζ β O , απ AO ; t,t ′ [ β i ] given in (46) and (47). The followings hold: for any π i , β i , i ∈ N , lim t − t ′ →∞ δ β O , απ AO ; t,t ′ [ π i ] = X g, a , ω k ,i ∈N α ( ω k | g, θ ) U β O π AO ( g, a , ω k | θ ) − X g, a , ω k ,i ∈N α ( ω k | g, θ ) U β O π i , π AO − i ( g, a i , a − i , ω k | θ ) , (49)18 PREPRINT - F

EBRUARY

16, 2021 lim t − t ′ →∞ ζ β O , απ AO ; t,t ′ [ β i ] = X g, a , ω k ,i ∈N α ( ω k | g, θ ) U β O π AO ( g, a , ω k | θ ) − X g, a , ω k ,i ∈N α ( ω k | g, θ ) U β i , β O − i π AO ( g, a , β i ( g, ω ki , θ i ) , ω k − i | ω k , θ ) , (50) and the corresponding dual variables are lim t − t ′ →∞ Λ π i ; ∗ i ; t,t ′ = λ α , β O ; µ i [ i ] ,π i , π AO − i , ∞ ( ·| θ i ) and lim t − t ′ →∞ Ξ β i ; ∗ i ; t,t ′ = ¯ λ α ,β i , β O − i ; µ i [ i ] , π AO , ∞ ( ·| θ i ) . Proposition 4.2 describes an asymptotic situation in which the length of sequence h i,t of Q π , β , α ; µ i i | t,t ′ is much largerthan the length of sequence ℓ ′ i,t of V π , β , α ; µ i i | t,t ′ in the extended Bellman equations (40)-(41). We remove the subscripts t and t ′ if t − t ′ → ∞ in the notations of the slack and the dual variables. This asymptotic result motivates a designregime for the signaling rule: Theorem 5.

Deﬁne a set of signaling rules, for any κ , A O [ κ ] ≡ n α ∗ : ∀ π ∗ ∈ Π O , α ∗ ∈ arg max α min π , β X π,β (cid:0) δ β O , απ ∗ [ π ] + ζ β O , απ ∗ [ β ] (cid:1) and X ω k ρ β O π ∗ ( g, a , ω k | θ ) α ∗ ( ω k | g, θ ) = ρ κ ( g, a | θ ) o . (51) Given a goal κ , a signaling rule α ∗ ∈ A O [ κ ] that induces < β O , π ∗ > is OIL-P if L β O , α ∗ π ∗ ( Λ , Ξ , V π ∗ , β O , α ∗ ; δ β O , α ∗ π O , ζ β O , α ∗ π ∗ | θ ) = 0 . In Theorem 5, the set A Ot,t ′ [ κ ] in (51) characterizes a design regime to determine signaling rules that realize theprincipal’s goal κ while each agent i has incentive to use the obedient selection rule for every possible observation o i .The α ∗ ’s maximizing the minimum of the slack variables (46) and (47) makes any of its induced < β O , π AO > bea feasible solution of the problem (26)-(29). From the Lagrangian sufﬁciency theorem, the condition L β ∗ , α ∗ π ∗ ( Λ t,t ′ , Ξ t,t ′ , V π ∗ , β O , α ∗ ; δ α ∗ , ζ α ∗ | θ ) = 0 yields that < β O , π ∗ > and the corresponding V π ∗ , β O , α ∗ (that satisﬁes (19)(21)) is a solution of (26)-(29) with Z OIL-P ( π AO , β O , V α OIL ; α OIL , κ ) = 0 . In this section, we introduce the optimality criterion of the principal’s goal κ and deﬁne the optimal informationdesign problem for the principal in a Markov game M [ α ] . We deﬁne the one-stage payoff function of the principalis u k : A × G × Θ R , such that u k ( a , g ; θ ) gives the immediate payoff for the principal when the state is g andthe agents with type θ take actions a . The principal’s goal κ is the probability distribution of the agents’ joint actionsin the equilibrium conditioned only on the global state and the agents’ types. Hence, the information structure thatmatters for the principal’s goal selection problem is given as O k ≡ < G , Θ , T g , d g , d θ > . The principal chooses a goalby maximizing her expected payoff ( ˆ γ -discounted, ˆ γ ∈ (0 , ), given as: Z k ( κ |O k ) ≡ E κ h ∞ X t =0 ˆ γ t u k ( a t , g t , θ ) (cid:12)(cid:12)(cid:12) O k i . (52)The principal’s goal κ is chosen such that her expected payoff (52) is maximized. However, the principal cannotforce the agents to take the actions or directly program agents’ actions according to κ ; instead, she uses information19 PREPRINT - F

EBRUARY

16, 2021design to elicit the agents to take actions that coincide with her goal κ . The following theorem discovers an importantrelationship between the goal κ and the agents’ equilibrium. Theorem 6.

The proﬁle < β O , π AO > is an admissible O-PBME if and only if the goal κ is a BMCE (Deﬁnition 0.6)and there exists an α ∗ ∈ A O [ κ ] such that L β O , α ∗ π ∗ ( Λ , Ξ , V π ∗ , β O , α ∗ ; δ β O , α ∗ π O , ζ β O , α ∗ π ∗ | θ ) = 0 . Theorem 6 strictly generalizes the Bergemann and Morris’s characterization of Bayes’ correlated equilibrium forincomplete-information static game (Theorem 1 of [6]) to our Markovian settings where agents select signals andtake actions. Basically, the BMCE characterizes all the possible O-PBME that could arise under all signaling rules in A O [ κ ] in which κ is any BMCE. Hence, the principal’s goal selection problem is a BMCE selection problem.Suppose the principal’s α induces a proﬁle < β , π > . With a slight abuse of notation, we deﬁne the principal’sexpected payoff from the agents’ behaviors by < β , π > as follows: Z k ( α , β , π | O α ) ≡ E h ∞ X t =0 ˆ γ t u k ( a t , g t ; θ ) π ( a t | g t , β ( g t , ω kt , θ ) , θ ) (cid:12)(cid:12)(cid:12) O α i , (53)with Z k ( α , π | O α ) = Z k ( α , β O , π | O α ) , where the information structure is O α = < G , Ω m , Θ , T g , P − k , α , d g , d θ > . We refer to (53) as the principal’s transformed problem . Deﬁne a set of PBME proﬁles PBME [ α ] , whenthe signaling rule is α : PBME [ α ] ≡ n β ∗ , π ∗ : ∀ β i , π i , i ∈ N Z k ( α , β ∗ , π ∗ | O α ) ≥ Z k ( α , β i , β ∗− i , π − i , π ∗ | O α ) o . (54)We will write PBME [ α , β ∗ ] as a set of PMBE policy proﬁle when agents use equilibrium selection rule proﬁle β ∗ (i.e., β ∗ is used in both sides of the inequality in (54)), given α .We deﬁne a set of signaling rules, for any policy proﬁle π ∗ , A O [ π ∗ ] ≡ n α ∗ : α ∗ ∈ arg α max min π , β X π,β (cid:0) δ β O , απ ∗ [ π ] + ζ β O , απ ∗ [ β ] (cid:1)o . (55)Hence, A O [ π ∗ ] is the set A O [ κ ] in (51) without the admissibility condition.Theorem 6 motivates the following transformation of the principal’s BMCE selection problem to an information designproblem: max κ ∈ BMCE Z k ( κ |O k ) = max α ∈ A O [ π ] max π ∈ PBME [ α , β O ] Z k ( α , π | O α ) . (56)Suppose κ ∗ is an optimal BCE for the principal. Suppose additionally that α ∗ is a solution to the RHS of (56).However, choosing a signaling rule from A O [ κ ] in (51) is in general a sufﬁcient condition for OIL-P. Hence, for any < β ∗ , π ∗ > ∈ PBME [ α ∗ ] , we may have L β ∗ , α ∗ π ∗ ( Λ , Ξ , V π ∗ , β ∗ , α ∗ ; δ β ∗ , α ∗ π ∗ , ζ β ∗ , α ∗ π ∗ | θ ) = 0 . That is, the principal’soptimal goal may not be realized by the designed signaling rule α ∗ , i.e., Z k ( κ ∗ |O k ) ≥ Z k ( α ∗ , β ∗ , π ∗ | O α ∗ ) . Whenthe choice of the signaling rule characterized in A O [ κ ] in (51) cannot guarantee the OIL-P, we consider a notion of ǫ -OIL-P which is a relaxation of OIL-P. Deﬁnition 6.1 ( ǫ -OIL-P) . We say that a signaling rule α ǫ is ǫ -OIL-P if it induces a proﬁle < β ∗ ǫ , π ∗ ǫ > with µ i , suchthat the belief µ i is updated according to (14) and, for all π i , β i , i ∈ N , Z k ( α ǫ , β ∗ ǫ , π ∗ ǫ ) + ǫ ≥ Z k ( α ǫ , β i , β ∗− i ; ǫ , π i , π ∗− i ; ǫ ) . (57)20 PREPRINT - F

EBRUARY

16, 2021Let v k ( · ; β , π , α ) : G × Θ R denote the state-value function associated with the principal’s transformed problem(53). Deﬁne, for any g ∈ G , θ ∈ Θ , Ψ( g, θ ; π , β , α ) ≡ v k ( g, θ ; β , π , α ) − E µ i π h X i R i ( g, a , β i ( g, ω k , θ i ) , θ i ) + γ X g ′ T g ( g ′ | g, a ) v k ( g ′ , θ ; β , π , α ) (cid:12)(cid:12)(cid:12) O α i . (58)If Ψ( g, θ ; π , β , α ) = 0 , the state-value function v k ( · ; β , π , α ) is the optimal state-value function (i.e., solution to theBellman optimality equation). Proposition 6.1.

Suppose the signaling rule α induces < β Oǫ , π APǫ > that is feasible with respect to the constraints(27) and (28). Let Φ( θ ; π , β , α ) ≡ E h Ψ( g, θ ; π , β , α ) (cid:12)(cid:12)(cid:12) O α i . Then, < β Oǫ , π APǫ > forms an ǫ -admissible O-PBME, with ǫ = Φ( θ ; π , β , α )1 − ˆ γ . The corresponding signaling rule α is ǫ -OIL-P. Proposition 6.1 characterizes the ǫ -OIL-P as an approximation to the optimal information design given in Theorem 5.Decent algorithms used to solve the information design problem may lead to approximated equilibrium behaviors andProposition 6.1 implies that when Φ( θ ; π , β , α ) is small enough relative to − ˆ γ , the computational outcome fromthe algorithm is good enough (i.e., ǫ → ). This work is the ﬁrst to propose an information design principle for incomplete-information dynamic games in whicheach agent makes coupled decisions of selecting a signal and taking an action at each period of time. We have formallydeﬁned a novel information design problem for the indirect and the direct settings and have restricted attention to thedirect one due to the obedient principle. The notion of obedient implementability has been introduced to capture theoptimality of the direct information design in the equilibrium concept of obedient perfect Bayesian Markov Nash equi-libria (O-PBME). By characterizing the obedient implementability, we have proposed an approach to determining theinformation structure by maximizing the optimal slack variables from the optimality of the agents’ equilibrium behav-iors. Our representation result formulates the principal’s optimal Bayesian Markov correlated equilibrium selection interms of information design implementable in O-PBME.

References [1] Anthony Dickinson. Actions and habits: the development of behavioural autonomy.

Philosophical Transactionsof the Royal Society of London. B, Biological Sciences , 308(1135):67–78, 1985.[2] Dirk Bergemann and Stephen Morris. Information design: A uniﬁed perspective.

Journal of Economic Literature ,57(1):44–95, 2019.[3] Ina Taneva. Information design.

American Economic Journal: Microeconomics , 11(4):151–85, 2019.[4] Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning.In

Advances in neural information processing systems , pages 1281–1288, 2005.21

PREPRINT - F

EBRUARY

16, 2021[5] Laurent Mathevet, Jacopo Perego, and Ina Taneva. On information design in games.

Journal of Political Econ-omy , 128(4):1370–1404, 2020.[6] Dirk Bergemann and Stephen Morris. Bayes correlated equilibrium and the comparison of information structuresin games.

Theoretical Economics , 11(2):487–522, 2016.[7] Emir Kamenica and Matthew Gentzkow. Bayesian persuasion.

American Economic Review , 101(6):2590–2615,2011.[8] Jeffrey Ely, Alexander Frankel, and Emir Kamenica. Suspense and surprise.

Journal of Political Economy ,123(1):215–260, 2015.[9] Juan Passadore and Juan Pablo Xandri. Robust conditional predictions in dynamic games: An application tosovereign debt.

Job Market Paper , 2015.[10] Laura Doval and Jeffrey C Ely. Sequential information design.

Econometrica , 88(6):2575–2608, 2020.[11] Jeffrey C Ely. Beeps.

American Economic Review , 107(1):31–53, 2017.[12] Jeffrey C Ely and Martin Szydlowski. Moving the goalposts.

Journal of Political Economy , 128(2):468–506,2020.[13] Miltiadis Makris and Ludovic Renou. Information design in multi-stage games. Technical report, working paper,2018.[14] Frédéric Koessler, Marie Laclau, and Tristan Tomala. Interactive information design.

HEC Paris Research PaperNo. ECO/SCD-2018-1260 , 2018.[15] Roger B Myerson. Optimal auction design.

Mathematics of operations research , 6(1):58–73, 1981.[16] Alessandro Pavan, Ilya Segal, and Juuso Toikka. Dynamic mechanism design: A myersonian approach.

Econo-metrica , 82(2):601–653, 2014.[17] Tao Zhang and Quanyan Zhu. On incentive compatibility in dynamic mechanism design with exit option in amarkovian environment, 2019.[18] Tao Zhang and Quanyan Zhu. On the differential private data market: Endogenous evolution, dynamic pricing,and incentive compatibility, 2021.[19] Paul Milgrom and Paul Robert Milgrom.

Putting auction theory to work . Cambridge University Press, 2004.[20] Satyanath Bhat, Shweta Jain, Sujit Gujar, and Yadati Narahari. An optimal bidimensional multi-armed banditauction for multi-unit procurement.

Annals of Mathematics and Artiﬁcial Intelligence , 85(1):1–19, 2019.[21] Tayfun Sönmez and M Utku Ünver. Matching, allocation, and exchange of discrete resources. In

Handbook ofsocial Economics , volume 1, pages 781–852. Elsevier, 2011.[22] Tao Zhang and Quanyan Zhu. Optimal two-sided market mechanism design for large-scale data sharing andtrading in massive iot networks. arXiv preprint arXiv:1912.06229 , 2019.[23] Daniel Dewey. Reinforcement learning and the reward engineering principle. In , 2014.[24] Raghav Nagpal, Achyuthan Unni Krishnan, and Hanshen Yu. Reward engineering for object pick and placetraining. arXiv preprint arXiv:2001.03792 , 2020. 22

PREPRINT - F

EBRUARY

16, 2021[25] Dylan Hadﬁeld-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design.In

Advances in neural information processing systems , pages 6765–6774, 2017.[26] Emir Kamenica. Bayesian persuasion and information design.

Annual Review of Economics , 11:249–272, 2019.[27] Isabelle Brocas and Juan D Carrillo. Inﬂuence through ignorance.

The RAND Journal of Economics , 38(4):931–947, 2007.[28] Luis Rayo and Ilya Segal. Optimal information disclosure.

Journal of political Economy , 118(5):949–987, 2010.[29] Itai Arieli and Yakov Babichenko. Private bayesian persuasion.

Journal of Economic Theory , 182:185–217,2019.[30] Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online bayesian persuasion.

Advances inNeural Information Processing Systems , 33, 2020.[31] Jean-Francois Mertens and Shmuel Zamir. Formulation of bayesian analysis for games with incomplete informa-tion.

International Journal of Game Theory , 14(1):1–29, 1985.[32] Itay Goldstein and Yaron Leitner. Stress tests and information disclosure.

Journal of Economic Theory , 177:34–69, 2018.[33] Nicolas Inostroza and Alessandro Pavan. Persuasion in global games with application to stress testing, 2018.[34] Penélope Hernández and Zvika Neeman. How bayesian persuasion can help reduce illegal parking and othersocially undesirable behavior.

Preprint , 2018.[35] Zinovi Rabinovich, Albert Xin Jiang, Manish Jain, and Haifeng Xu. Information disclosure as a means tosecurity. In

Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems ,pages 645–653. Citeseer, 2015.[36] Scott Gehlbach and Konstantin Sonin. Government control of the media.

Journal of public Economics , 118:163–171, 2014.[37] Sanmay Das, Emir Kamenica, and Renee Mirka. Reducing congestion through information design. In , pages 1279–1284. IEEE,2017.[38] Darrell Dufﬁe, Piotr Dworczak, and Haoxiang Zhu. Benchmarks in search markets.

The Journal of Finance ,72(5):1983–2044, 2017.[39] Martin Szydlowski. Optimal ﬁnancing and disclosure.

Management Science , 67(1):436–454, 2021.[40] Daniel Garcia and Matan Tsur. Information design in competitive insurance markets.

Journal of EconomicTheory , 191:105160, 2021.[41] Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Maximum causal entropy correlated equilibria for markovgames. In

AAMAS , pages 207–214. Citeseer, 2011.[42] John C Harsanyi. Games with incomplete information played by “bayesian” players, i–iii part i. the basic model.

Management science , 14(3):159–182, 1967.[43] Martin J Osborne et al.

An introduction to game theory , volume 3.23

PREPRINT - F

EBRUARY

16, 2021[44] Atsushi Kajii and Stephen Morris. The robustness of equilibria to incomplete information.

Econometrica: Jour-nal of the Econometric Society , pages 1283–1309, 1997.[45] Yevgeniy Dodis, Shai Halevi, and Tal Rabin. A cryptographic solution to a game theoretic problem. In

AnnualInternational Cryptology Conference , pages 112–130. Springer, 2000.[46] Onésimo Hernández-Lerma and Jean B Lasserre.

Discrete-time Markov control processes: basic optimalitycriteria , volume 30. Springer Science & Business Media, 2012.[47] Richard Bellman. Dynamic programming.

Science , 153(3731):34–37, 1966.[48] Jerzy Filar and Koos Vrieze. Competitive markov decision processes-theory, algorithms, and applications. 1997.[49] Martin L Puterman.

Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons,2014.[50] Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In

Proceedings of the 25th international conference on Machine learning , pages 1032–1039, 2008.[51] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In

Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems , pages 4572–4580, 2016.[52] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe.