Mechanisms for a No-Regret Agent: Beyond the Common Prior
aa r X i v : . [ ec on . T H ] S e p Mechanisms for a No-Regret Agent:Beyond the Common Prior ∗Modibo Camara † Jason Hartline ‡ Aleck Johnsen § September 14, 2020
Abstract
A rich class of mechanism design problems can be understood as incomplete-informationgames between a principal who commits to a policy and an agent who responds, with payoffsdetermined by an unknown state of the world. Traditionally, these models require strong andoften-impractical assumptions about beliefs (a common prior over the state). In this paper, wedispense with the common prior. Instead, we consider a repeated interaction where both theprincipal and the agent may learn over time from the state history. We reformulate mechanismdesign as a reinforcement learning problem and develop mechanisms that attain natural bench-marks without any assumptions on the state-generating process. Our results make use of novelbehavioral assumptions for the agent – centered around counterfactual internal regret – thatcapture the spirit of rationality without relying on beliefs. ∗ This work began as part of the 2018 Special Quarter on Data Science and Online Markets at Northwestern. Weare especially grateful to Simina Brânzei and Katya Khmelnitskaya for their early contributions to this project. We arealso grateful to Eddie Dekel, Marciano Siniscalchi, and several anonymous referees for helpful comments, in additionto audiences at the 70th Midwest Theory Day and Northwestern. Jason Hartline and Aleck Johnsen were supported inpart by NSF grant CCF-1618502. † Department of Economics, Northwestern University. Email: [email protected]. ‡ Department of Computer Science, Northwestern University. Email: [email protected]. § Department of Computer Science, Northwestern University. Email: [email protected].
Introduction
Mechanism design is a branch of economic theory concerned with the design of social institutions.It encompasses a wide range of phenomena that have historically been of interest to economists,including, but not limited to, auctions (Myerson 1981; Vickrey 1961), matching markets (Gale andShapley 1962; Roth 1982), taxation (Mirrlees 1971), contracts (Ross 1973; Spence and Zeckhauser1971), and persuasion (Kamenica and Gentzkow 2011).Despite this field’s potential, it is often unclear whether and how mechanisms derived fromeconomic theory can be implemented in practice. In particular, one modeling practice stands outas a barrier to implementation: the common prior assumption. Many mechanism design problemsare only interesting in the presence of uncertainty, and this uncertainty is typically modeled asstochasticity. The state of the world is drawn according to some distribution and, importantly, thedistribution is commonly known by the designer and all participants in the mechanism. This paper will dispense with the common prior assumption. In its place, we consider a modelof adversarial online learning where the principal and a single agent are learning about the state,over time, using data. The static mechanism design problem is a Stackelberg game of incompleteinformation. The principal chooses a policy 𝑝 , the agent chooses a response 𝑟 , nature chooses a state 𝑦 , and payoffs are realized. In the online problem, this game is repeated 𝑇 times, where state 𝑦 𝑡 isrevealed at the end of period 𝑡 . The sequence of states is arbitrary and the principal’s mechanismshould perform well without prior knowledge of the sequence. The principal’s present choices canaffect the agent’s future behavior; this makes mechanism design a reinforcement learning problemin our model.In the absence of distributional assumptions, standard restrictions on the agent’s behavior, likeBayesian rationality, become toothless. In its place, we define counterfactual internal regret (CIR)and assume that the agent obtains low CIR. This is an ex post definition of rationality that includesBayesian rationality (with a well-calibrated prior) as a special case. We develop data-driven mecha-nisms that are guaranteed to perform well under our behavioral assumptions. Specifically, we proveupper bounds on the principal’s regret from following our mechanism, relative to the single fixedpolicy that performs best in hindsight. Our results take the form of reductions from the principal’sproblem to robust versions of static mechanism design with a common prior. Running Example.
Bayesian persuasion is a model of strategic communication, due to Kamenicaand Gentzkow (2011). It has received considerable attention from economists and, more recently,algorithmic game theorists (e.g. Dughmi and Xu 2016, Cummings et al. 2020). It is a useful testcase for our framework because (a) it is interesting even with only one agent, (b) the optimal solutionvaries with the agent’s beliefs, and (c) it has the potential to be widely applicable. This assumption is limiting in two ways. First, mechanisms based on a common prior may not be practicable,because they rely on knowledge that a real-world designer is unlikely to possess. Second, even if the designer knowsthe distribution (resp. has beliefs), the participants may not arrive with the same knowledge (resp. share those beliefs). Bayesian persuasion has been used to study a wide range of topics, including recommendation systems (Mansouret al. 2016), traffic congestion (Das et al. 2017), congested social services (Anunrojwong et al. 2020), financial stress- 𝑦 ∈ {High , Low} describes the drug’s quality. Neither the regulator nor the company know thequality in advance. The company needs to design a clinical trial that will generate (possibly noisy)information about the drug’s quality. Roughly, a trial 𝑝 specifies the probability 𝑝 ( 𝑚, 𝑦 ) of sendinga message 𝑚 to the regulator, conditional on the drug quality 𝑦 . Informally, the message describesthe outcome of the trial. After hearing the message, the regulator decides whether to approve thedrug. The regulator receives a payoff if it approves a high-quality drug or rejects a low-quality drug.The company receives a payoff if the regulator approves, regardless of quality. Its challenge is todesign a clinical trial that convinces the regulator to approve as many drugs as possible.To predict behavior in incomplete-information games, we need to make assumptions about howthe agents deal with uncertainty. The common prior is one such assumption. In our running exam-ple, the common prior would specify a probability 𝑞 ∈ [0 , that the drug is high quality. Considerthe case 𝑞 = 1∕3 . If the company does not run a trial – e.g. it recommends “approve” in every state– the regulator would never approve, as the drug is more likely to be low quality than high qualityex ante. If the company runs the most thorough trial possible – e.g. it recommends “approve’ ifand only if the drug is high quality – the regulator would approve with probability . Finally,consider the optimal trial. The optimal trial always recommends “approve” if the drug is high qual-ity. If the drug is low quality, it recommends “approve” and “reject” with equal probability. Afterhearing “approve”, the regulator’s posterior puts equal weight on both states, and so it might as wellapprove. Here, the regulator approves with probability . Online Mechanism Design.
In our model, both the company and the regulator would be learningabout drug quality over time. New drugs arrive sequentially. For each drug, the company designsa clinical trial and generates a message. The regulator hears the message and decides whether toapprove. Regardless of whether the drug is approved, both parties eventually learn the drug’s truequality, and the next drug arrives. The company’s strategy, called a mechanism , maps the drug (i.e.state) history and the approval decision (i.e. response) history to a trial for the current drug. Theregulator’s strategy, called a learning algorithm or learner , maps the drug quality history and thetrial (i.e. policy) history to an approval decision for the current drug. This model is online becausethe company and regulator must make decisions while the drugs are still arriving. It is adversarial in the sense that we impose no assumptions on the sequence of drugs, and so any results (e.g.claiming that a mechanism performs well) must hold for all such sequences.The company’s problem is to develop a mechanism that performs as well as the best-in-hindsighttrial. That is, the company should not regret following its mechanism relative to any simple alter-native where it picks the same trial 𝑝 in every period. To evaluate what would have happened underan alternative sequence of trials, the company must take into account how the regulator’s behav-ior would have changed. Therefore, the company faces a reinforcement learning problem and itsbenchmark corresponds to the notion of policy regret in the literature on bandit learning with adap- testing (Goldstein and Leitner 2018), and worker motivation (Ely and Szydlowski 2020). This fact precludes a simplesolution to the company’s problem; we must constrain the regulator’s behavior. The standard way to constrain the regulator/agent’s behavior – i.e. to capture “self-interest” inthe absence of a meaningful notion of ex ante optimality – is to impose upper bounds on the agent’sregret. This will be our approach as well. We build on existing no-regret assumptions, in ways thatare intended to refine and better motivate those assumptions.
No-Regret Agents.
Two notions of regret have been used historically: external and internal (orswap) regret (ER and IR). For example, Nekipelov et al. (2015) show how ER bounds combinedwith bidding data can be used to partially identify bidder valuations in a dynamic auction. Braver-man et al. (2018) consider a dynamic pricing problem against no-ER agents. Their analysis isgeneralized by Deng et al. (2019), who study repeated Stackelberg games of complete information.Furthermore, the literature on no-regret learning in games has established that if agents satisfy ano-ER (resp. no-IR) property in a repeated game, the empirical distribution of their actions willconverge to a coarse correlated equilibrium (resp. correlated equilibrium) (Blum, Hajiaghayi, et al.2008; Foster and Vohra 1997; Hart and Mas-Colell 2001; Hartline, Syrgkanis, et al. 2015).Both ER and IR can be thought of as “non-policy” regret, because they do not take into accounthow the agent’s behavior affects the behavior of others. The justification for these regret boundsis that (a) they are satisfied by well-known learning algorithms (see e.g. Littlestone and Warmuth1994 for ER), and (b) they generalize optimality conditions associated with a stationary equilibrium.Nonetheless, these regret bounds can be problematic. Effectively, they assume that agents are (a)sophisticated enough to obtain low non-policy regret, but (b) not aware that their true objectiveis policy regret. Keep in mind that an agent who minimizes policy regret can easily obtain highnon-policy regret, and thereby violate the regret bounds.To avoid this problem, the principal in our model can commit to a mechanism that is nonre-sponsive to the agent’s behavior: the policy 𝑝 𝑡 depends on the state history but not on the agent’sresponse history. When mechanisms are nonresponsive, non-policy regret and policy regret co-incide for the agent. Then, bounds on the agent’s regret are permissive assumptions that allow awide range of sophisticated and self-interested behavior, including Bayesian rationality. Keep in Arora, Dekel, et al. (2012) obtain positive results when the adversary satisfies a bounded memory assumption.Ryabko and Hutter (2008) obtain positive results under a different kind of assumption, that the environment is suffi-ciently “forgiving” of mistakes. These papers reflect two prominent approaches in reinforcement learning: (a) restrict-ing attention to Markov decision processes, and (b) assuming an ability to “reset” the problem (Kearns et al. 1999). Arora, Dinitz, et al. (2018) consider policy regret in a repeated game and use the self-interest of the adaptiveadversary to motivate behavioral restrictions. This is reminiscent of a literature on multi-agent reinforcement learningwhen the state is Markovian (Buşoniu et al. 2010; Hu and Wellman 1998; Littman 1994; Uther and Veloso 2003).Unlike these papers, we do not have the ability in our model to advise all participants simultaneously. In their model, the agent is “learning” an appropriate response to the principal’s pricing strategy. If the agents usenaive mean-based learners, Braverman et al. (2018) provide a mechanism that extracts the full surplus. In particular,the agent fails to anticipate the mechanism that the principal is using. As they point out, this leads to odd behavior: theagent may purchase goods at a price exceeding her valuation. In our setting, the agent does not face uncertainty withrespect to the mechanism; instead, she faces uncertainty with respect to the state sequence. Counterfactual Internal Regret.
Without constraints on the agent’s behavior, an early mistakeby the principal can result in a permanent, undesirable shift in the agent’s behavior. As we willsee, this can occur when the agent behaves as if she has additional information about the stateof the world that is not accounted for in our description of the model. The agent can make theprincipal’s problem infeasible if she is willing to exploit her information selectively, i.e. based onthe principal’s choice of policies. Unfortunately, neither no-ER nor no-IR assumptions can rule outselective use of information.Our notion of rationality requires the agent to fully and consistently exploit her information,regardless of the principal’s chosen policies. Existing benchmarks like external and internal regretcannot capture this requirement. To see why, it helps to consider the fable of the tortoise and thehare. Both animals have an hour to traverse a one-mile track. For the tortoise, this requirement isfeasible and binding: finishing in time means hustling, without substantial breaks or detours. Forthe hare, however, the requirement is hardly restrictive: it may stop for a break, walk rather thanrun, or even run around in circles while still finishing the race in time. Benchmarks like externalor internal regret imply reasonable behavior for an uninformed agent (i.e. the tortoise). But for aninformed agent (i.e. the hare), these benchmarks are easy enough to satisfy that it may engage inall kinds of frivolous behavior – possibly to the detriment of the principal.The solution to our analogy is to strengthen the hare’s benchmark. If the hare has to traverse thetrack in three minutes, it needs to hustle, like the tortoise. Similarly, if the agent has to obtain no-regret with her information as additional context, this would preclude the kind of frivolous behaviorthat makes the principal’s problem infeasible. Of course, setting this benchmark requires us to knowthe nature and quality of the agent’s information, just as we needed to know the top speed of thehare. The idea behind counterfactual internal regret is that we can identify the agent’s informationwith her past behavior under counterfactual mechanisms. Intuitively, any information that is usefulshould eventually reveal itself through variation in behavior.
Main Results.
This paper considers three variations on our model: one where the principal knowsthe agent’s information, one where the agent has no private information, and one where the agentmay have private information. In each case, we propose a mechanism and bound on the principal’sregret in terms of the agent’s counterfactual internal regret (CIR).Our first mechanism is intended as a warmup. It requires oracle access to the agent’s informa-tion and has poor performance in finite samples, but avoids some complications associated withinformation asymmetry between the principal and agent. First, the mechanism produces a cali-brated forecast of the state in the current period using off-the-shelf algorithms, using the oracle as This approach seems spiritually similar to that of Immorlica, Mao, et al. (2020), who develop mechanisms thatincentivize efficient social learning. By restricting attention to simple disclosures (i.e. unbiased subhistories), theysignificantly simplify the agents’ inferential problem and can motivate a permissive notion of frequentist rationality.Having restricted disclosure in this manner, they nonetheless design mechanisms with optimal rates of convergence. 𝜖 -robust version of the common prior game. In that game, the agent’s response only needs tobe 𝜖 -approximately optimal, and the mechanism substitutes its forecast for the prior.Theorem 1 bounds the principal’s regret under this mechanism, under some restrictions on thestage game. Suppose there are 𝑛 states, 𝑛 policies, and 𝑛 responses. Fix a parameter 𝜖 > (controlling robustness) and 𝛿 > (controlling the fineness of a grid). Our bound is 𝑂 ( 𝜖 ) ⏟⏟⏟ cost of 𝜖 -robustness + 1 𝜖 ⎛⎜⎜⎜⎜⎜⎝ 𝑂 (CIR) ⏟⏟⏟ agent’s regret + ̃𝑂 ( 𝛿 𝑛 𝑛 𝑛 𝑛 𝑇 ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ forecast miscalibration + 𝑂 ( 𝛿 ) ⏟⏞⏟⏞⏟ approximation error ⎞⎟⎟⎟⎟⎟⎠ (1)If the agent satisfies no-CIR, i.e. CIR → as 𝑇 → ∞ , then the principal’s regret vanishes in 𝑇 aslong as 𝜖, 𝛿 → at the appropriate rates. Moreover, the principal’s average payoffs converge to anatural benchmark: what he would have obtained in a stationary equilibrium of the repeated gamewith a common prior (the empirical distribution conditioned on agent’s information).Our second mechanism applies when the agent is as uninformed as the principal. This mech-anism is identical to the first, except its forecast does not use information revealed by the learner.We formalize “uninformedness” by assuming that the agent’s external regret is non-negative (inconjunction with no-CIR). Theorem 2 bounds the principal’s regret under this mechanism, undersome additional restrictions on the stage game. Our bound is 𝑂 ( 𝜖 ) ⏟⏟⏟ cost of 𝜖 -robustness + 1 𝜖 ⎛⎜⎜⎜⎜⎜⎝ 𝑂 (CIR) ⏟⏟⏟ agent’s regret + ̃𝑂 ( 𝛿 𝑛 𝑛 𝑇 ) ⏟⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏟ forecast miscalibration + 𝑂 ( 𝛿 ) ⏟⏞⏟⏞⏟ approximation error ⎞⎟⎟⎟⎟⎟⎠ (2)Compared to (1), this drops the exponential dependence on the number 𝑛 of policies. This isbecause the principal’s forecast does not need to take into account the agent’s information, whichsignificantly reduces the forecast miscalibration in finite samples.Our third mechanism applies even when the agent is more informed than the principal. Here,we consider an “informationally robust” version of the stage game, due to Bergemann and Morris(2013), where the agent receives a private signal from an unknown information structure. Likebefore, we formulate an 𝜖 -robust version of this game, where the agent’s response need only be 𝜖 -approximately optimal. Our mechanism is identical to the second mechanism, except that it choosesthe worst-case optimal policy in the 𝜖 -informationally-robust game instead of the 𝜖 -robust game.Theorem 3 bounds the principal’s regret under this mechanism, under some restrictions on thestage game. Let ̂𝜋 𝑇 denote the empirical distribution of states 𝑦 𝑇 . Given a common prior 𝜋 , let ∇( 𝜋 ) be the difference between the principal’s maxmin payoff and his maxmax payoff across all5ossible information structures. Roughly, our bound is ∇( ̂𝜋 𝑇 ) ⏟⏟⏟ cost of informational robustness + 𝑂 ( 𝜖 ) ⏟⏟⏟ cost of 𝜖 -robustness + 1 𝜖 ⎛⎜⎜⎜⎜⎜⎝ 𝑂 (CIR) ⏟⏟⏟ agent’s regret + ̃𝑂 ( 𝛿 𝑛 𝑛 𝑇 ) ⏟⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏟ forecast miscalibration + 𝑂 ( 𝛿 ) ⏟⏞⏟⏞⏟ approximation error ⎞⎟⎟⎟⎟⎟⎠ (3)Unlike (1) and (2), the principal’s regret does not vanish as 𝑇 → ∞ . However, it is vanishing up tothe cost of informational robustness ∇( ̂𝜋 𝑇 ) that would also be present under a common prior, if theagent were more informed than the principal.Finally, although our focus is not on computational complexity, the reader should note that thecomputational tractability of our mechanisms will depend critically on our ability to solve robustmechanism design problems under a common prior. So, while our bounds on the principal’s regretapply to a large class of games, evaluating tractability may require a case-by-case analysis. Additional Related Work.
Within computer science, many researchers share our goal of replac-ing prior knowledge in mechanism design with data. The literature on sample complexity in mech-anism design allows the principal to learn the state distribution from i.i.d. samples (Balcan, Blum,Hartline, et al. 2008; Cole and Roughgarden 2014; Morgenstern and Roughgarden 2015; Syrgkanis2017). Here, the data arrives as a batch rather than online, there is no repeated interaction and thequestion of responsiveness does not arise. However, there has also been work that applies onlinelearning to auction design (e.g. Daskalakis and Syrgkanis 2016; Dudík et al. 2017) and Stackelbergsecurity games (e.g. Balcan, Blum, Haghtalab, et al. 2015). Here, agents are either short-lived ormyopic, whereas our agent is long-lived and potentially forward-looking.These papers can avoid the agent’s learning problem because they emphasize applications wherethe agent does not face uncertainty, or where truthfulness is a dominant strategy. In contrast, Cum-mings et al. (2020) and Immorlica, Mao, et al. (2020) study problems that are closer to our own,insofar as both the principal and the agent must learn from data. They impose behavioral assump-tions that are suited for i.i.d. data, whereas our model generalizes to adversarial data.Within economics, research has focused on relaxing prior knowledge, rather than replacing itentirely. Part of the literature on robust mechanism design relaxes the common prior to some kindof approximate agreement on the distribution (Artemov et al. 2013; Jehiel et al. 2012; Meyer-ter-Vehn and Morris 2011; Ollár and Penta 2017; Oury and Tercieux 2012). Our approach will suggest 𝜖 -robustness and 𝜖 -informational-robustness as alternatives to “approximate agreement”. Organization.
Section 2 introduces the stage game and 𝜖 -robustness. Section 3 introduces therepeated game. Section 4 defines external, internal, and counterfactual internal regret. Section 4presents our mechanism and regret bounds when the agent’s learner is known. As preparation forthe remaining results, section 5 introduces the stage game with private signals. Section 4 presentsour mechanism and regret bounds when the agent is uninformed. Section 4 presents our mechanism6nd regret bounds when the agent may be more informed than the principal. Section 9 concludeswith a discussion of open problems.Appendix A applies these results to two special cases: our running example, and a principal-agent problem. Appendix B considers the complexity of the agent’s learning problem. AppendixC describes our forecasting algorithms in more detail. Appendix D relaxes some of the restrictionson the stage game and generalizes our results. Appendix E collects proofs. Our model features three participants: a male principal, a female agent, and nature. As advertised,we are interested in a repeated interaction between these participants. To begin with, however, wedescribe the stage game, which will constitute a single-round of the repeated game. In the stagegame, the principal moves first and commits to a policy 𝑝 ∈ . Next, the agent observes the policy 𝑝 and then chooses a response 𝑟 ∈ . Utility functions depend on the response 𝑟 , the policy 𝑝 , andan unknown state of the world 𝑦 ∈ , chosen by nature. Formally, the agent’s utility function is 𝑈 ∶ × × → [0 , while the principal’s utility function is 𝑉 ∶ × × → [0 , . Assumption 1 (Regularity) . We impose the following regularity conditions.1. The state space is finite.2. The response space is a compact space with metric 𝑑 .3. The policy space is a compact space with metric 𝑑 .4. The utility 𝑈 is equi-Lipschitz continuous in ( 𝑟, 𝑝 ) for Lipschitz constants 𝐾 𝑈 and 𝐾 𝑈 , i.e. ∀ 𝑦 ∈ ∶ | 𝑈 ( 𝑟, 𝑝, 𝑦 ) − 𝑈 ( ̃𝑟, ̃𝑝, 𝑦 ) | ≤ 𝐾 𝑈 𝑑 ( 𝑟, ̃𝑟 ) + 𝐾 𝑈 𝑑 ( 𝑝, ̃𝑝 )
5. The utility 𝑉 is equi-Lipschitz continuous in ( 𝑟, 𝑝 ) for Lipschitz constants 𝐾 𝑉 and 𝐾 𝑉 , i.e. ∀ 𝑦 ∈ ∶ | 𝑉 ( 𝑟, 𝑝, 𝑦 ) − 𝑉 ( ̃𝑟, ̃𝑝, 𝑦 ) | ≤ 𝐾 𝑉 𝑑 ( 𝑟, ̃𝑟 ) + 𝐾 𝑉 𝑑 ( 𝑝, ̃𝑝 ) Later on, we will use covers to convert infinite action spaces into discrete approximations. Forexample, our running example involved an infinite policy space.
Definition 1 (Covers) . Let be a metric space with metric 𝑑 . Generally, lower case letters 𝑥 denote elements of while upper case letters 𝑋 denote subsets.1. Fix 𝛿 > . Let the partition be a 𝛿 cover of . That is, for every set 𝑋 ∈ 𝑋 , any twoelements 𝑥, ̃𝑥 ∈ 𝑋 must be within distance 𝛿 of one another, i.e. 𝑑 ( 𝑥, ̃𝑥 ) < 𝛿 .2. To reduce notation, we also let denote a discretized subset of . That is, for each set 𝑋 ∈ , choose a unique 𝑥 ∈ 𝑋 to represent 𝑋 . In that case, we say 𝑥 ∈ . . Let 𝑥 ∈ and ̃𝑥 ∈ . We say that ̃𝑥 is the discretization of 𝑥 if 𝑥, ̃𝑥 belong to the samesubset 𝑋 ∈ . We will refer to covers of the policy space (with metric 𝑑 ), of the response space(with metric 𝑑 ), and Δ( ) of the state distributions Δ( ) (with the 𝑙 metric). Of course, if theunderlying set is finite to begin with, we can simply set 𝛿 = 0 and let = .The stage game plays an important role in our analysis. Two of our results (theorems 1 and 2) arebest understood as reducing the online mechanism design problem to the simpler task of finding a“locally-robust” policy in the stage game. In the locally-robust problem, we maintain the traditionalcommon prior assumption: that is, the state 𝑦 is drawn from a commonly known distribution 𝜋 .However, we relax the assumption that the agent maximizes her expected utility E 𝑦 ∼ 𝜋 [ 𝑈 ( 𝑟, 𝑝, 𝑦 )] .Instead, she chooses a response (or a distribution 𝜇 over responses) that guarantees her an expectedutility that is within an additive constant 𝜖 of the optimum. Since this assumption only partiallyidentifies the agent’s behavior, the principal’s utility can take on a range of values. The principal’sworst-case utility from following policy 𝑝 is described by the function 𝛼 𝑝 ( 𝜋, 𝜖 ) = min 𝜇 ∈Δ( ) E 𝑦 ∼ 𝜋 [ E 𝑟 ∼ 𝜇 [ 𝑉 ( 𝑟, 𝑝, 𝑦 )] ] s . t . max ̃𝑟 ∈ E 𝑦 ∼ 𝜋 [ 𝑈 ( ̃𝑟, 𝑝, 𝑦 )] − E 𝑦 ∼ 𝜋 [ E 𝑟 ∼ 𝜇 [ 𝑈 ( 𝑟, 𝑝, 𝑦 )] ] ≤ 𝜖 and his best-case utility is described by 𝛽 𝑝 ( 𝜋, 𝜖 ) = max 𝜇 ∈Δ( ) E 𝑦 ∼ 𝜋 [ E 𝑟 ∼ 𝜇 [ 𝑉 ( 𝑟, 𝑝, 𝑦 )] ] s . t . max ̃𝑟 ∈ E 𝑦 ∼ 𝜋 [ 𝑈 ( ̃𝑟, 𝑝, 𝑦 )] − E 𝑦 ∼ 𝜋 [ E 𝑟 ∼ 𝜇 [ 𝑈 ( 𝑟, 𝑝, 𝑦 )] ] ≤ 𝜖 The worst-case optimal (or 𝜖 -robust) policy, defined below, is one of two main ingredients in ourproposed mechanisms (the other is a calibrated forecasting algorithm). Definition 2 ( 𝜖 -Robustness) . The 𝜖 -robust policy is worst-case optimal over all response distribu-tions 𝜇 that achieve at least the agent’s optimal expected utility minus 𝜖 . Formally, policy is 𝑝 ∗ ( 𝜋, 𝜖 ) ∈ arg max 𝑝 ∈ 𝛼 𝑝 ( 𝜋, 𝜖 ) Definition 3 (Cost of 𝜖 -Robustness) . Fix a distribution 𝜋 and parameter 𝜖 > . The cost of 𝜖 -robustness is the distance between the principal’s best-case utility (under the best-case optimalpolicy) and worst-case utility (under the worst-case optimal policy). Formally, Δ( 𝜋, 𝜖 ) = max 𝑝 ∈ 𝛽 𝑝 ( 𝜋, 𝜖 ) − 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝜖 ) The cost of 𝜖 -robustness will be a key variable in our upper bounds on the principal’s regret inthe repeated game. It will be convenient to assume that this cost is growing at most linearly in 𝜖 ,although this assumption is not really necessary (see appendix D). Assumption 2.
For any distribution 𝜋 , Δ( 𝜋, 𝜖 ) = 𝑂 ( 𝜖 ) . Note that , , Δ( ) are all compact. Therefore, we can always construct a finite cover. 𝜖 , the agent only achieves her optimal expected utility minus 𝜖 + ̃𝜖 , for ̃𝜖 > . Nonetheless,if the principal uses the 𝜖 -robust policy, his utility degrades smoothly in the residual ̃𝜖 . Lemma 1.
Assume regularity (assumption 1). For any distribution 𝜋 , policy 𝑝 , and constants 𝜖, ̃𝜖 > , the principal’s worst-case and best-case utilities satisfy 𝛼 𝑝 ( 𝜋, 𝜖 + ̃𝜖 ) ≥ 𝛼 𝑝 ( 𝜋, 𝜖 ) − ̃𝜖𝜖 and 𝛽 𝑝 ( 𝜋, 𝜖 + ̃𝜖 ) ≤ 𝛽 𝑝 ( 𝜋, 𝜖 ) + ̃𝜖𝜖 Appendix A describes two well-known special cases of our model: Bayesian persuasion andthe principal-agent problem. For each case, we provide a simple example, check that the examplesatisfies all relevant assumptions, and evaluate our results. In that sense, these examples serve assanity checks for the rest of the paper, which involves assumptions and solutions that are sometimesrather abstract.
In the repeated game, the stage game is repeated 𝑇 times. In period 𝑡 , the principal chooses policy 𝑝 𝑡 , the agent chooses response 𝑟 𝑡 , and nature chooses the state 𝑦 𝑡 . At the end of period 𝑡 , the state 𝑦 𝑡 is revealed to both the principal and the agent.The agent’s repeated game strategy (henceforth, learner 𝐿 ) maps the state history 𝑦 𝑡 −1 , theresponse history 𝑟 𝑡 −1 , the policy history 𝑝 𝑡 −1 , and the current policy 𝑝 𝑡 to a distribution 𝜇 𝑡 overresponses. Formally, the response distribution in the 𝑡 th period is given by 𝐿 𝑡 ∶ 𝑡 −1 × 𝑡 −1 × 𝑡 → Δ( ) The principal’s repeated game strategy (henceforth, mechanism 𝜎 ) maps the state history 𝑦 𝑡 −1 ,the response history 𝑟 𝑡 −1 , and the policy history 𝑝 𝑡 −1 to a distribution 𝜈 𝑡 over policies. Formally,the policy distribution in the 𝑡 th period is given by 𝜎 𝑡 ∶ 𝑡 −1 × 𝑡 −1 × 𝑡 −1 → Δ( ) Our goal is to design a mechanism 𝜎 ∗ that the principal would not regret using, relative toa finite set of alternative mechanisms. Regret – which we define momentarily – measures thegap in performance between 𝜎 ∗ and the alternative mechanism 𝜎 that performed best in hindsight,given the realized sequence of states 𝑦 𝑇 . We consider a simple set of alternative mechanisms,corresponding to some finite set of fixed policies ⊆ that the principal wishes to consider. The fact that the response distribution 𝜇 𝑡 may depend on realized response history 𝑟 𝑡 −1 allows the learner tointroduce correlation between responses across time, if desired. Many of our results and definitions can be adapted to any finite set of nonresponsive mechanisms. 𝑝 , we mean a constant mechanism 𝜎 𝑝 that selects the same policy 𝜎 𝑝𝑡 ( 𝑦 𝑡 −1 , 𝑟 𝑡 −1 , 𝑝 𝑡 −1 ) = 𝑝 in all periods 𝑡 and for all histories.To define the principal’s regret, we need notation for the agent’s behavior under the proposedmechanism 𝜎 ∗ , as well as under the counterfactual mechanisms 𝜎 𝑝 . Fix the state sequence 𝑦 𝑇 . Let 𝜇 ∗ 𝑡 describe the agent’s behavior under 𝜎 ∗ , i.e. 𝜇 ∗ 𝑡 = 𝐿 𝑡 ( 𝑦 𝑡 −1 , 𝑟 ∗1∶ 𝑡 −1 , 𝑝 ∗1∶ 𝑡 ) given the realized history of responses 𝑟 ∗1∶ 𝑡 −1 and policies 𝑝 ∗1∶ 𝑡 under 𝜎 ∗ . Let 𝜇 𝑝𝑡 describes the agent’sbehavior under 𝜎 𝑝 , i.e. 𝜇 𝑝𝑡 = 𝐿 𝑡 ( 𝑦 𝑡 −1 , 𝑟 𝑝 𝑡 −1 , ( 𝑝, … , 𝑝⏟⏟⏟ t times )) given the realized history of responses 𝑟 𝑝 𝑡 −1 under 𝜎 𝑝 . Definition 4 (Principal’s Regret) . The principal’s regret relative to the best-in-hindsight fixed policy 𝑝 ∈ is PR(
𝐿, 𝑦 𝑇 ) = sup 𝑝 ∈ 𝑇 𝑇 ∑ 𝑡 =1 ( E 𝑟 ∼ 𝜇 𝑝𝑡 [ 𝑉 ( 𝑟, 𝑝, 𝑦 𝑡 ) ] − E 𝑟 ∼ 𝜇 ∗ 𝑡 [ 𝑉 ( 𝑟, 𝜎 ∗ ( 𝑦 𝑡 −1 , 𝑟 ∗1∶ 𝑡 −1 , 𝑝 𝑡 −1 ) , 𝑦 𝑡 ) ]) The mechanism 𝜎 ∗ satisfies no-regret if the principal’s regret is 𝑜 (1) , i.e. it vanishes as 𝑇 → ∞ .Recall that the no-regret mechanism design problem is infeasible without further assumptions onthe learner 𝐿 . The following proposition formalizes this simple observation. Proposition 1 (Impossibility Result for Unrestricted Learners) . In our running example, for everymechanism 𝜎 ∗ , there exists a learner 𝐿 along with a state sequence 𝑦 such that the principal’sregret does not vanish, i.e. lim 𝑇 → ∞ PR(
𝐿, 𝑦 𝑇 ) > In this section, we develop a restriction on the learner 𝐿 that captures “rational” behavior by theagent, without requiring assumptions on the state sequence 𝑦 𝑇 . In particular, we build on no-regretassumptions pioneered in the literature on learning in games.In online learning, regret measures how much better or worse off the agent would have beenhad she followed the best-in-hindsight “simple” strategy instead of her learner. Different notions ofregret correspond to different definitions of simplicity. All of the regret notions used in this paperwill be special cases of contextual regret , defined as follows. Given a sequence 𝑧 𝑇 of variables in10ome arbitrary set , contextual regret considers a strategy “simple” if, for any two periods 𝑡 and 𝜏 , sharing the same context 𝑧 𝑡 = 𝑧 𝜏 implies taking the same response 𝑟 𝑡 ≠ 𝑟 𝜏 . Definition 5.
Given a sequence 𝑧 𝑇 of covariates, the agent’s contextual regret relative to a best-in-hindsight modification rule ℎ ∶ → is CR( 𝑝 𝑇 , 𝑦 𝑇 ) = max ℎ 𝑇 𝑇 ∑ 𝑡 =1 ( 𝑈 ( ℎ ( 𝑧 𝑡 ) , 𝑝 𝑡 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ) Note that, unlike our definition of the principal’s regret, the agent’s contextual regret does nottake into account how changes in her past behavior would have also affected the principal’s behavior.This omission is justified when the mechanism is nonresponsive.
Definition 6 (Responsiveness) . A mechanism 𝜎 is nonresponsive if 𝜎 𝑡 ( 𝑦 𝑡 −1 , 𝑟 𝑡 −1 , 𝑝 𝑡 −1 ) = 𝜎 𝑡 ( 𝑦 𝑡 −1 , ̃𝑟 𝑡 −1 , 𝑝 𝑡 −1 ) for any period 𝑡 , state history 𝑦 𝑡 −1 , policy history 𝑝 𝑡 −1 , and response histories 𝑟 𝑡 −1 , ̃𝑟 𝑡 −1 . Our mechanisms will be nonresponsive. This is a design choice, not an assumption. In re-stricting attention to nonresponsive mechanisms, we simplify the agent’s problem and make ourbehavioral assumptions more credible. If our mechanisms were responsive, non-myopic agentswould not necessarily satisfy no-regret as defined above. For example, an agent might decide toforgo an otherwise-optimal response if she believes said response would trigger an undesirablepolicy by the principal going forward. This behavior would be perfectly reasonable but couldcause the agent to accumulate regret. Finally, as it turns out, even nonresponsive mechanisms canguarantee vanishing principal’s regret in two of the scenarios we study (sections 5 and 7). In thesescenarios, there is limited room for responsive mechanisms to improve our guarantees.In the remainder of this section, we define three special cases of contextual regret: externalregret (ER), internal regret (IR), and counterfactual internal regret (CIR).
In our model, external regret is contextual regret where the policy 𝑝 𝑡 is the context in period 𝑡 . Thatis, no-ER requires the agent to perform as well as the best-in-hindsight mapping from policies 𝑝 𝑡 to responses 𝑟 𝑡 . Now, why should external regret include the policy as context? Because our stagegame is an extensive form. A strategy in the stage game is not a response; it is a function from theobserved policy to a response. Our definition of external regret compares the agent’s performanceto the best-in-hindsight strategy in the stage game. For instance, in models of repeated sales, a buyer may refuse to purchase a good at a reasonable price if she believesthat holding out will cause the seller to reduce prices in the future (Devanur et al. 2019; Immorlica, Lucier, et al. 2017). Suppose that, instead, we compared the agent’s performance to the best-in-hindsight response 𝑟 ∈ . Definingexternal regret in this way would confound variation in policies with variation in the state, and could lead to odd
11n immediate difficulty with defining ER is that the set may be continuous. For instance,this is true in our running example. To ensure that the agent’s learning problem is feasible in thatcase, we allow the agent to group together nearby policies according to the cover (defined insection 2), and consider regret with respect to this coarser context. Of course, when the policyspace is finite, there is no need for this, and we can set = . Definition 7 (External Regret) . The agent’s external regret (ER) relative to the best-in-hindsightmodification rule ℎ ∶ → is ER( 𝑝 𝑇 , 𝑦 𝑇 ) = max ℎ 𝑇 𝑇 ∑ 𝑡 =1 ( 𝑈 ( ℎ ( 𝑝 𝑡 ) , 𝑝 𝑡 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ) Note the slight abuse of notation. By ℎ ( 𝑝 𝑡 ) , we mean ℎ ( 𝑃 𝑡 ) where 𝑃 𝑡 is the unique set in the partition that contains 𝑝 𝑡 . Although common in the literature (e.g. Nekipelov et al. 2015, Braverman et al. 2018), no-ER assumptions are insufficient for our problem. They do not circumvent the impossibility result(proposition 1) that motivated us to restrict the agent’s behavior in the first place. In particular, thisis because they fail to rule out certain pathological behaviors. Because these pathological behaviorsare clearly not in the agent’s best interest, we also conclude that no-ER fails to rule out “irrational”behavior and is therefore not a good definition of “rationality”. The following proposition (and itsproof) clarifies the issue.
Proposition 2 (Impossibility Result for No-ER Learners) . In our running example, for every mech-anism 𝜎 ∗ , there exists a learner 𝐿 that guarantees no-ER on all state/policy sequences, i.e. lim 𝑇 → ∞ sup ̃𝑝 𝑇 , ̃𝑦 𝑇 E 𝐿 [ ER( ̃𝑝 𝑇 , ̃𝑦 𝑇 ) ] = 0 along with a state sequence 𝑦 such that the principal’s regret does not vanish, i.e. lim 𝑇 → ∞ PR(
𝐿, 𝑦 𝑇 ) > Before defining CIR, we provide a brief intuition: what went wrong with external regret? Recall thetortoise and hare analogy in the introduction. For a behavioral assumption to rule out pathologicalbehaviors, it may have to adapt to the information of the agent (or the speed of the animal). behavior. For example, consider the“mean-based” learner in Braverman et al. (2018), which never deviates far fromthe response that maximizes the agent’s empirical utility. In that paper, the learner engages in odd behavior, likespending more than the agent’s valuation.Our definition is more similar to that of Hartline, Johnsen, et al. (2019), where agents following a dashboard providedby the mechanism will best respond to an allocation rule given the empirical value distribution, rather than best respondto the empirical bid distribution. This way, the agent adapts sensibly to changes in the principal’s policy. If is continuous, the policy 𝑝 𝑡 may be unique in every period 𝑡 = 1 , … , ∞ . In that case, requiring no-ER wouldbe equivalent to requiring ex post optimality. That is unreasonably strong. 𝑦 𝑇 is predictable or not. In particular, the agent may behave as if she possesses “private information”about the sequence of states that goes beyond the “public information” inherent in the description ofthe model. In practice, the agent may have access to data that the principal lacks, notice a pattern thatdid not occur to the principal, or succeed through dumb luck. Formally, this reflects an adversarywho simultaneously chooses the state sequence 𝑦 𝑇 and the learner 𝐿 to cause the mechanism 𝜎 ∗ to underperform. In particular, even though the agent may not observe 𝑦 𝑡 when choosing a response 𝑟 𝑡 , this cannot prevent the adversary from “correlating” 𝑟 𝑡 and 𝑦 𝑡 . No-CIR requires the agent to consistently and fully exploit her private information. In the spiritof revealed preference, private information is identified with her behavior across counterfactualmechanisms. Intuitively, if the agent is able to distinguish between periods 𝑡, 𝜏 and finds it usefulto do so, then her behavior should also differ between those two periods. If her behavior under onemechanism reveals private information, this information should also be accessible to her under adifferent mechanism. This logic allows us to define a purely ex post notion of rationality that doesnot refer to the agent’s beliefs or to a distribution over state sequences.No-CIR refines no-IR, a weaker condition that was developed in the literature on calibration(e.g. Foster and Vohra 1997). Internal regret is contextual regret where the context is the agent’sown behavior 𝑟 𝑇 . To ensure that the agent’s learning problem is feasible when the response space is infinite, we allow the agent to group together nearby responses according to the cover , andconsider regret with respect to this coarser context. Of course, when the response space is finite,as in our running example, there is no need for this, and we can set = . Definition 8 (Internal Regret) . The agent’s internal regret (IR) relative to the best-in-hindsightmodification rule ℎ ∶ 𝑆 × 𝑆 → is IR( 𝑝 𝑇 , 𝑦 𝑇 ) = max ℎ 𝑇 𝑇 ∑ 𝑡 =1 ( 𝑈 ( ℎ ( 𝑝 𝑡 , 𝑟 𝑡 ) , 𝑝 𝑡 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ) Like earlier, note the slight abuse of notation. By ℎ ( 𝑝 𝑡 , 𝑟 𝑡 ) , we mean ℎ ( 𝑃 𝑡 , 𝑅 𝑡 ) where ( 𝑃 𝑡 , 𝑅 𝑡 ) is theunique set in the collection × that contains ( 𝑝 𝑡 , 𝑟 𝑡 ) . Counterfactual internal regret is contextual regret where the context is the concatenation of: thepolicy 𝑝 ∗ 𝑡 under the proposed mechanism 𝜎 ∗ ; the agent’s behavior 𝑟 ∗1∶ 𝑇 under 𝜎 ∗ ; and her counter-factual behavior 𝑟 𝑝 𝑇 under the fixed policies 𝑝 ∈ . The following definitions formalize this. To be clear, this “correlation” is non-causal. For example, the adversary might choose a state sequence such that 𝑦 𝑡 = 1 on even periods and 𝑦 𝑡 = 0 on odd periods, and a learner 𝐿 such that 𝑟 𝑡 = 1 on even periods and 𝑟 𝑡 = 0 onodd periods. Empirically-speaking, there would be a correlation between the states and the responses. However, if wesubsequently changed the value of state 𝑦 𝑡 in some period 𝑡 , this would not affect the response 𝑟 𝑡 , because the state isnot observed and cannot affect the output of the learner 𝐿 . That is, there is no causal relationship between 𝑟 𝑡 and 𝑦 𝑡 . efinition 9 (Information) . Let the information partition be = 𝑆 ⏟⏟⏟ policy 𝑝 ∗ 𝑡 × 𝑆 ⏟⏟⏟ response 𝑟 ∗ 𝑡 × ( 𝑆 ) | | ⏟⏞⏟⏞⏟ responses 𝑟 𝑝𝑡 for 𝑝 ∈ and let the information 𝐼 𝑡 in period 𝑡 be the unique set in that satisfies 𝐼 𝑡 ∋ ( 𝑝 ∗ 𝑡 , 𝑟 ∗ 𝑡 , ( 𝑟 𝑝𝑡 ) 𝑝 ∈ ) Note that, by definition, the same information 𝐼 𝑡 is available to the agent regardless of whetherthe principal follows our mechanism 𝜎 ∗ or deviates to some fixed policy 𝑝 ∈ . Intuitively, theprincipal’s choice of mechanism should not affect what information the agent has available. Definition 10 (Counterfactual Internal Regret) . The agent’s counterfactual internal regret (CIR) relative to the best-in-hindsight modification rule ℎ ∶ 𝑆 × 𝑆 | | +1 → is CIR( 𝑝 𝑇 , 𝑦 𝑇 ) = max ℎ 𝑇 𝑇 ∑ 𝑡 =1 ( 𝑈 ( ℎ ( 𝐼 𝑡 ) , 𝑝 𝑡 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ) The discussion in the proof of proposition 2 clarifies how no-CIR rules out the kinds of patho-logical or irrational behavior that no-ER fails to rule out. In the next section, we will see the crucialrole that no-CIR plays in our proving our bounds on the principal’s regret. The essential prop-erty is that, conditional on information 𝐼 𝑡 , the agent chooses a roughly constant response that isapproximately best-in-hindsight for whichever mechanism the principal is considering. Our first result should be viewed as pedagogical. It bounds the principal’s regret under a mechanismthat requires oracle access to the agent’s learner. This requirement is unrealistic and will be removedin sections 5 and 6. Likewise, the bound itself will feature an exponential dependence on the sizeof the policy space. This dependence will also be removed in later sections.
Definition 11 (Information Oracle) . The information oracle Ω 𝑡 ∶ → specifies the information 𝐼 𝑡 that the learner 𝐿 would generate in period 𝑡 given any policy 𝑝 𝑡 ∈ and the realized history. This case is a convenient starting point because it avoids the bulk of the information asymmetriesbetween the principal and the agent that our later results need to address. That follows from thefact that any private information generated by the learner can be anticipated by the principal withaccess to the information oracle. This case is also a convenient point of departure from the commonprior assumption because it permits a wider range of agent behavior without relaxing the principal’sknowledge of said behavior. To be clear, under a common prior, the fact that the principal knowsthe agent’s prior means that he also has precise knowledge of the agent’s learner. In addition, since14he agent is Bayesian, the agent does not find it beneficial to randomize and her learner will typicallybe deterministic. Essentially, the common prior provides an information oracle for free.
Mechanism 1.
Let the distribution 𝜋 𝑡 be a forecast of the state 𝑦 𝑡 generated by a calibrated fore-casting algorithm that uses the agent’s information as context.• Our forecasting algorithm applies a generic no-internal-regret algorithm due to Blum andMansour (2007) in an auxilliary learning problem where the action space consists of dis-cretized forecasts 𝜋 ∈ Δ( ) and the loss function is the negated quadratic scoring rule 𝑆 . Ineach period, the algorithm makes a prediction 𝜋 𝑡 and incurs loss − 𝑆 ( 𝜋 𝑡 , 𝑦 𝑡 ) . Further detailsas well as rates of convergence are in appendix C.• The context is the vector of outputs Ω( 𝑝 ) of the information oracle under discretized policies 𝑝 ∈ . The forecasting algorithm is run separately for each context.Fix a parameter ̄𝜖 > . In period 𝑡 , the informed-principal mechanism 𝜎 ∗ chooses the discretizationof the ̄𝜖 -robust policy 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) that treats the forecast 𝜋 𝑡 as a common prior. Before stating the theorem in full, we present the reasoning behind the result and clarify thecomponents of the regret bound, as well as the assumptions required. First, we require some ad-ditional notation. Let “ 𝑡 ∈ 𝐼 ” indicate that information 𝐼 is present in period 𝑡 , i.e. 𝐼 𝑡 = 𝐼 . Let 𝑛 𝐼 = ∑ 𝑇𝑡 =1 ( 𝑡 ∈ 𝐼 ) indicate the number of periods with information 𝐼 . Let ̂𝜋 𝐼 be the empiricaldistribution conditioned on the agent having information 𝐼 , i.e. ̂𝜋 𝐼 ( 𝑦 ) = 1 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 ( 𝑦 𝑡 = 𝑦 ) We begin with a straightforward but important observation: across all periods 𝑡 ∈ 𝐼 , the agent’sresponse 𝑟 ∗ 𝑡 is roughly constant, as are her counterfactual responses 𝑟 𝑝𝑡 under fixed policies 𝑝 ∈ .By regularity (1), slight variations in responses have correspondingly slight impacts on the agent’sand principal’s utility. Suppose that these responses are exactly constant, i.e. 𝑟 𝑡 = 𝑟 𝐼 . Note that 𝑝 𝑡 = 𝑝 𝐼 is exactly constant as well, across these time periods, for all constant mechanisms 𝜎 𝑝 aswell as the proposed mechanism 𝜎 ∗ , which uses discretized policies. With everything constant, theprincipal’s average utility across context 𝐼 takes on a familiar form: 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 𝑡 ) = E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑉 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] Similarly, the agent’s average utility is 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 𝑡 ) = E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] Essentially, within each context 𝐼 , we have recreated the stage game with common prior ̂𝜋 𝐼 . The15gent accumulates regret 𝜖 𝐼 = max ̃𝑟 E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( ̃𝑟, 𝑝, 𝑦 )] − E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( 𝑟 𝐼 , 𝑝, 𝑦 ) ] Under mechanism 1, the principal chooses (roughly) the ̄𝜖 -robust policy for the forecast 𝜋 𝑡 .Suppose for the moment that the forecasts are also roughly constant for all periods 𝑡 ∈ 𝐼 , i.e. 𝜋 𝑡 = 𝜋 𝐼 . Since the forecast is calibrated and uses information 𝐼 𝑡 as context, 𝜋 𝐼 cannot be too far inthe 𝑙 distance from ̂𝜋 𝐼 (this is essentially the definition of calibration, and follows from results inappendix C). It follows from regularity that the ̄𝜖 -robust policy for 𝜋 𝐼 is nearly ̄𝜖 -robust for ̂𝜋 𝐼 .At this point, the principal has (roughly) applied the ̄𝜖 -robust policy for the empirical distribu-tion ̂𝜋 𝐼 , to an agent that obtains regret 𝜖 𝐼 . In that sense, the principal has misjudged the agent’scapacity to make mistakes. However, recall lemma 1: this affects the principal’s best-case andworst-case utilities by at most 𝜖 𝐼 ∕ ̄𝜖 . It follows that, roughly-speaking, the principal’s utility is notmuch worse than the worst-case optimal utility. At the same time, it cannot be much better than thebest-case optimal utility. More precisely, max ̃𝑝 𝛽 ̃𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + 𝜖 𝐼 ̄𝜖 ≥ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑉 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] ≥ max ̃𝑝 𝛼 ̃𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) − 𝜖 𝐼 ̄𝜖 (4)By assumption 2, the difference between the upper bound and the lower bound is 𝑂 ( ̄𝜖 ) + 𝑂 ( 𝜖 𝐼 ̄𝜖 ) (5)This pins down the principal’s utility under mechanism 1. Moreover, the upper bound in (4) alsoapplies to any constant mechanism 𝜎 𝑝 for 𝑝 ∈ . Therefore, (5) also bounds the regret accumulatedby the principal in context 𝐼 .This brings us to our key assumption: the agent’s CIR is at most some constant 𝜖 . Assumption 3 (Bounded CIR) . Let 𝑦 𝑇 be the realized state sequence and let 𝑝 ∗1∶ 𝑇 be the policysequence generated by the proposed mechanism 𝜎 ∗ . There exists a constant 𝜖 ≥ such that 𝜖 ≥ CIR( 𝑦 𝑇 , 𝑝 ∗1∶ 𝑇 ) and 𝜖 ≥ CIR( 𝑦 𝑇 , 𝑝, … , 𝑝⏟⏟⏟ 𝑡 times ) , ∀ 𝑝 ∈ Remark 1.
It is worth emphasizing that this bound applies only to the realized state sequence 𝑦 𝑇 .That is, the agent does not need to perform well over all state sequences, and her objective neednot be worst-case regret minimization. If the agent is Bayesian, for example, she will obtain lowCIR as long as her beliefs are well-calibrated. Since CIR is contextual regret with information 𝐼 𝑡 as context, bounded CIR ensures that 𝜖 ≥ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 𝜖 𝐼 𝜖 𝐼 in the context of information 𝐼 , and itfollows that the principal’s regret is bounded above by 𝑂 ( ̄𝜖 ) + 𝑂 ( 𝜖̄𝜖 ) To transform this intuition into a result, we need to address an assumption made along the way:that the forecast 𝜋 𝑡 is roughly constant across all periods 𝑡 ∈ 𝐼 . This is not necessarily true. Theadversary can choose a sequence of states 𝑦 𝑇 that makes the principal appear more informed thanthe agent. Indeed, variation in forecasts can be interpreted as private information of the principal,even if it is spurious. On the other hand, any variation in 𝜋 𝑡 that affects the policy 𝑝 𝑡 will also beincluded in the agent’s information 𝐼 𝑡 . What remains is variation in 𝜋 𝑡 that does not affect the policy– useless information from the principal’s perspective, but not necessarily useless to the agent. Ifthe principal expects the agent to exploit this information and the agent does not, this can lead to asuboptimal policy choice.The following assumption restricts attention to stage games where this problem does not arise;that is, the agent’s failure to exploit information that is useless to the principal does not affect theprincipal’s utility. In appendix D, we avoid this restriction by instead assuming that the principal –using our publicly-announced mechanism – is not more informed than the agent. Assumption 4.
Let 𝜖 > . Let 𝜋 and ̃𝜋 be distributions in the stage game. If the 𝜖 -robust policiesunder 𝜋 and under ̃𝜋 are close to one another, then they are also close to the 𝜖 -robust policy underany convex combination of these distributions. Formally, for any 𝜆 ∈ [0 , , 𝑑 ( 𝑝 ∗ ( 𝜋, 𝜖 ) , 𝑝 ∗ ( 𝜆𝜋 + (1 − 𝜆 ) ̃𝜋, 𝜖 )) = 𝑂 ( 𝑑 ( 𝑝 ∗ ( 𝜋, 𝜖 ) , 𝑝 ∗ ( ̃𝜋, 𝜖 )) ) The following theorem formalizes the preceding discussion and bounds the principal’s regretunder mechanism 1.
Theorem 1.
Assume regularity (assumption 1), restrictions on the stage game (assumptions 2, 4),and 𝜖 -bounded CIR (assumption 3). Let 𝜎 ∗ be the mechanism 1. Given access to the informationoracle, for any constant ̄𝜖 > , the principal’s expected regret E 𝜎 ∗ [ PR(
𝐿, 𝑦 𝑇 ) ] is at most 𝑂 ( ̄𝜖 ) ⏟⏟⏟ cost of ̄𝜖 -robustness + 1 ̄𝜖 ⋅ ⎛⎜⎜⎜⎜⎜⎝ 𝑂 ( 𝜖 ) ⏟⏟⏟ agent’s regret + ̃𝑂 ( 𝑇 −1∕4 √| || Δ( ) || 𝑅 | ( | | + | | )∕2 ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ forecast miscalibration + 𝑂 ( 𝛿 ) ) + 𝑂 ( 𝛿 ) + 𝑂 ( 𝛿 ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ discretization error ⎞⎟⎟⎟⎟⎟⎠ Remark 2.
Here are a few comments on this result.1. The bound depends on the size of the partitions , , and Δ . However, if we define thesepartitions to be as small as possible, we can replace these terms with the covering numbersof , , and Δ( ) , respectively. In that sense, our finite sample bounds will deteriorate asone increases the complexity of the action and state spaces. . Furthermore, if we define these partitions to be the smallest possible, then theorem 1 impliesthat the principal’s regret vanishes if 𝑇 → ∞ and 𝜖, ̄𝜖, 𝛿 Δ( ) , 𝛿 , 𝛿 → at the appropriaterates. It also follows from the proof that the principal’s payoffs converge to a natural bench-mark: what he would have obtained in a stationary equilibrium of the repeated game whereit is common knowledge that 𝑦 𝑡 is drawn independently from the empirical distribution ̂𝜋 𝐼 𝑡 .Formally, 𝑇 𝑇 ∑ 𝑡 =1 𝑉 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) − 1 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 max 𝑝 ∈ 𝛽 𝑝 ( ̂𝜋 𝐼 , →
3. Finally, note the exponential dependence on the number of alternative mechanisms | | andthe size of the policy space cover | | . This dependence, which is not present in theorems 2and 3, reflects the fact that the mechanism 1 uses the agent’s information 𝐼 𝑡 as context for itsforecast 𝜋 𝑡 . Since our bound is uniform across all learners that satisfy 𝜖 -bounded CIR on therealized state sequence 𝑦 𝑇 , it must accommodate learners that generate a lot of information,regardless of whether that information is useful. As mentioned at the beginning of this section,this is another reason why the “informed principal” setting seems less compelling than thesettings studied in sections 7 and 8. In general, we cannot expect the principal to have access to an information oracle. Fortunately, wecan still construct mechanisms 𝜎 ∗ that obtain vanishing or bounded principal’s regret without anyknowledge of the learner. However, in order to state the relevant assumptions (sections 7 and 8)and describe the mechanism (section 8), we need to consider scenarios where the agent has privateinformation that the principal lacks. This requires a brief detour. In this section, we revisit the stagegame in order to introduce terminology that reflects agent’s private information.Suppose that the state 𝑦 is drawn from a known distribution 𝜋 , but the agent has access to aprivate signal 𝐼 ∈ generated by the information structure 𝛾 . Definition 12 (Information Structure) . An information structure is a function 𝛾 ∶ × → [0 , where 𝛾 ( ⋅ , 𝑦 ) is a probability distribution over . The game proceeds as follows. First, nature chooses a hidden state 𝑦 ∼ 𝜋 . Second, the principalchooses a policy 𝑝 . Third, the agent observes a signal 𝐼 ∼ 𝛾 ( ⋅ , 𝑦 ) and chooses a response 𝑟 𝐼 . Forinstance, if the agent maximizes her expected utility, her responses after signals 𝐼 would be 𝑟 𝐼 ∈ arg max ̃𝑟 𝐼 ∈ E 𝑦 ∼ 𝜋 [ E 𝐼 ∼ 𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( ̃𝑟 𝐼 , 𝑝, 𝑦 ) ]] Finally, the state 𝑦 is revealed and payoffs are determined.As in section 2, suppose the agent does not necessarily maximize her expected utility. Instead,she chooses responses 𝑟 𝐼 (or distributions 𝜇 𝐼 over responses) that guarantees her an expected utility18hat is within an additive constant 𝜖 of the optimum. For a given information structure 𝛾 , theprincipal’s worst-case utility from following policy 𝑝 is described by 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) = min 𝜇 𝐼 ∈Δ( ) E 𝑦 ∼ 𝜋 [ E 𝐼 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 𝐼 [ 𝑉 ( 𝑟, 𝑝, 𝑦 )] ]] subject to max ̃𝑟 𝐼 ∈ E 𝑦 ∼ 𝜋 [ E 𝐼 ∼ 𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( ̃𝑟 𝐼 , 𝑝, 𝑦 ) ]] − E 𝑦 ∼ 𝜋 [ E 𝐼 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 𝐼 [ 𝑈 ( 𝑟, 𝑝, 𝑦 )] ]] ≤ 𝜖 and his best-case utility is described by 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 ) = max 𝜇 𝐼 ∈Δ( ) E 𝑦 ∼ 𝜋 [ E 𝐼 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 𝐼 [ 𝑉 ( 𝑟, 𝑝, 𝑦 )] ]] subject to max ̃𝑟 𝐼 ∈ E 𝑦 ∼ 𝜋 [ E 𝐼 ∼ 𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( ̃𝑟 𝐼 , 𝑝, 𝑦 ) ]] − E 𝑦 ∼ 𝜋 [ E 𝐼 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 𝐼 [ 𝑈 ( 𝑟, 𝑝, 𝑦 )] ]] ≤ 𝜖 Note that 𝛼 ( 𝜋, 𝜖 ) , the worst-case utility in the stage game without a private signal, is equivalent to 𝛼 ( 𝜋, 𝛾 , 𝜖 ) when 𝛾 is uninformative. The same applies to 𝛽 .Recall that our theorem 1 could be interpreted as reducing the online mechanism design problemto the simpler task of finding a 𝜖 -robust policy in the stage game without a private signal. The sameis true of our next result, theorem 2. In contrast, theorem 3 reduces the online problem to solving fora robust policy when the agent has a private signal generated by an unknown information structure.This corresponds to notion of informational robustness introduced by Bergemann and Morris (2013)and applied by Bergemann, Brooks, et al. (2017), applied to our single-agent setting. Definition 13 ( 𝜖 -Informational-Robustness) . The worst-case optimal (or 𝜖 -informationally-robust )policy for an unknown information structure 𝛾 is 𝑝 † ( 𝜋, 𝜖 ) ∈ arg max 𝑝 ∈ inf 𝛾 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) Definition 14 (Cost of 𝜖 -Informational-Robustness) . Fix a distribution 𝜋 and parameter 𝜖 > . Thecost of 𝜖 -informational-robustness is the distance between the principal’s best-case utility (underthe best-case optimal policy (for the best-case information structure) and worst-case utility (underthe worst-case optimal policy for the worst-case information structure). Formally, ∇( 𝜋, 𝜖 ) = max 𝑝 ∈ sup 𝛾 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 ) − max 𝑝 ∈ inf 𝛾 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) Let ∇( 𝜋 ) = ∇( 𝜋, denote the cost of informational robustness in the traditional setting wherethe agent is optimizing exactly ( 𝜖 = 0 ). It will be convenient to assume that the cost is growing atmost linearly in 𝜖 , although this assumption is not really necessary (see appendix D). Assumption 5.
For any distribution 𝜋 , ∇( 𝜋, 𝜖 ) = ∇( 𝜋 ) + 𝑂 ( 𝜖 ) . Why do we evaluate the cost of informational robustness under the worst-case information structure? Because theregret guarantee that we obtain in theorem 3 applies uniformly across all learners 𝐿 . As we will see, different learnerswill induce different empirical information structures 𝛾 . Our cost of informational robustness must accommodate theworst-case information structure, which loosely corresponds to the worst-case learner. Lemma 2.
Assume regularity (assumption 1). For any distribution 𝜋 , information structure 𝛾 ,policy 𝑝 , and constants 𝜖, ̃𝜖 > , the principal’s worst-case and best-case utilities satisfy 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 + ̃𝜖 ) ≥ 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) − ̃𝜖𝜖 and 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 + ̃𝜖 ) ≤ 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 ) + ̃𝜖𝜖 Our second result bounds the principal’s regret under a mechanism that does not require detailedknowledge of the learner 𝐿 . Instead, this result assumes that the agent is not more informed thanthe principal. To begin, the mechanism is as follows. Mechanism 2.
Let the distribution 𝜋 𝑡 be a forecast of the state 𝑦 𝑡 .• Our forecasting algorithm applies a generic no-internal-regret algorithm due to Blum andMansour (2007) in an auxilliary learning problem where the action space consists of dis-cretized forecasts 𝜋 ∈ Δ( ) and the loss function is the negated quadratic scoring rule.Fix a parameter ̄𝜖 > . In period 𝑡 , the uninformed-agent mechanism 𝜎 ∗ chooses the discretizationof the ̄𝜖 -robust policy 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) that treats the forecast 𝜋 𝑡 as a common prior. What does it mean for an agent to be uninformed? Following the intuition developed in section4, the agent’s behavior cannot reveal an understanding of the state sequence that goes far beyondthe principal’s forecast. This can be formalized by adding a lower bound on the agent’s ER to ourupper bound on the agent’s (counterfactual) IR. Assumption 6 (Lower-Bounded ER) . Let 𝑦 𝑇 be the realized state sequence and let 𝑝 ∗1∶ 𝑇 be thepolicy sequence generated by the proposed mechanism 𝜎 ∗ . There exists a constant ̃𝜖 ≥ such that ER( 𝑦 𝑇 , 𝑝 ∗1∶ 𝑇 ) ≥ − ̃𝜖 and ER( 𝑦 𝑇 , 𝑝, … , 𝑝⏟⏟⏟ 𝑡 times ) ≥ − ̃𝜖, ∀ 𝑝 ∈ While there is no a priori sense in which the deterministic sequence 𝑦 𝑇 is predictable or not,this combination of bounds can be seen as an ex post definition of unpredictability. Intuitively, if anagent fully exploits the information she reveals under the proposed mechanism 𝜎 ∗ (no-IR) withoutoutperforming the best use of public information (non-negative ER), her private information cannotbe particularly useful. Fully exploiting useless information generally means ignoring it.To see this, suppose the policy 𝑝 is fixed and that the learner obtains non-positive IR and non-negative ER. It is trivial to show that IR is non-negative and bounded below by ER, so it follows Although they study a different problem, Blum, Gunasekar, et al. (2018) also use lower bounds on ER to proveresults, exploiting the fact that exponential weights guarantees non-negative expected ER (Gofer and Mansour 2016). 𝑡 ∈ 𝐼 with information 𝐼 as context. It followed immediately from the definition of information that theagent’s responses 𝑟 𝑡 were roughly some constant 𝑟 𝐼 . Furthermore, since the principal’s forecastsused 𝐼 𝑡 as context, the constant policy 𝑝 𝐼 was calibrated to the empirical distribution ̂𝜋 𝐼 .Now, our mechanism does not have access to 𝐼 𝑡 and is not calibrated to ̂𝜋 𝐼 . Instead, for everypolicy context 𝑃 ∈ Σ , it is calibrated to the empirical distribution ̂𝜋 𝑃 conditioned on 𝑝 𝑡 ∈ 𝑃 .Formally, ̂𝜋 𝑃 ( 𝑦 ) = 1 𝑛 𝑃 ∑ 𝑡 ∈ 𝑃 ( 𝑦 𝑡 = 𝑦 ) where 𝑡 ∈ 𝑃 indicates 𝑝 𝑡 ∈ 𝑃 and 𝑛 𝑃 is the number of periods 𝑡 ∈ 𝑃 . The policy context 𝑃 is coarser than information 𝐼 , by definition of the latter. So, the principal behaves as if the agentshares his prior ̂𝜋 𝑃 , while the agent behaves as if she receives 𝐼 as a private signal.This is where non-negative ER comes in. The agent’s information 𝐼 is useless to her. If thereis a unique best-in-hindsight response within policy context 𝑃 , then the agent will choose roughlythe same response 𝑟 𝑡 = 𝑟 𝑃 in every period 𝑡 ∈ 𝑃 . In other words, the policy context 𝑃 coincideswith the agent’s information 𝐼 , and the principal is correct in assuming that the agent (roughly)optimizes against the empirical distribution ̂𝜋 𝑃 . Our previous argument goes through.Again, we just assumed that there is a unique best-in-hindsight response within policy context 𝑃 . What if this is not the case, i.e. the best-in-hindsight response is not unique? In general, theargument breaks down. The agent can condition her action on her private information 𝐼 , which nolonger necessarily coincides with 𝑃 . To be clear, this private signal 𝐼 remains useless to the agent.Moreover, the ̄𝜖 -robust policy is by definition robust to multiplicity of best responses. However, ifthe agent’s best response is correlated with the state, this can undermine the principal’s utility evenif it does not affect the agent’s. The following assumption restricts attention to stage games where this issue does not arise.Informally, it asserts that if a private signal is useless to the agent, then it has limited relevance tothe principal, assuming that the principal is following (nearly) optimal policies. Formally, the valueof information structure 𝛾 to the agent in the stage game with common prior 𝜋 and policy 𝑝 is 𝜙 𝑝 ( 𝜋, 𝛾 ) = max 𝑟,𝑟 𝐼 ∈ E 𝑦 ∼ 𝜋 [ E 𝐼 ∼ 𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( 𝑟 𝐼 , 𝑝, 𝑦 ) ] − 𝑈 ( 𝑟, 𝑝, 𝑦 ) ] For example, consider a stage game with a binary response 𝑟 ∈ {0 , , a binary state 𝑦 ∈ {0 , , and a binarypolicy 𝑝 ∈ {Risky , Safe} . The agent’s utility is always zero. The principal’s utility under the risky policy is if 𝑟 = 𝑦 and −1 otherwise. It is slightly negative under the safe policy. If 𝑦 is drawn from the uniform distribution, and the agentoptimizes without a signal, then the principal prefers the risky policy. If the agent receives a signal that is perfectlycorrelated with the state, and sets 𝑟 = 1 − 𝑦 , then the principal prefers the safe policy. 𝛾 minus theexpected utility of the agent if she does not receive a private signal. Assumption 7.
Let 𝜋 be a distribution, 𝜖 > be a constant, and 𝛾 be an information structure(intuitively, one that is not useful to the agent).1. If the principal uses 𝜖 -robust policy 𝑝 ∗ ( 𝜋, 𝜖 ) , his maxmin payoff without 𝛾 , i.e. 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝜖 ) ,is not much larger than his maxmin payoff with 𝛾 , i.e. 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) . That is, 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝜖 ) − 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) = 𝑂 ( 𝜙 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝛾 ) ) + 𝑂 ( 𝜖 )
2. The principal’s maxmax payoff with 𝛾 under any policy 𝑝 ∈ , i.e. 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 ) , is not muchlarger than his maxmax payoff without 𝛾 under the best-case policy, i.e. max ̃𝑝 ∈ 𝛽 ̃𝑝 ( 𝜋, 𝜖 ) .That is, 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 ) − max ̃𝑝 ∈ 𝛽 ̃𝑝 ( 𝜋, 𝜖 ) = 𝑂 ( 𝜙 𝑝 ( 𝜋, 𝛾 ) ) + 𝑂 ( 𝜖 ) Both parts of assumption 7 would be immediate if the information structure 𝛾 were uninforma-tive, because the left-hand sides would be non-positive. Basically, we require useless (to the agent)private signals to be similar to uninformative private signals in these two respects.Finally, we are ready to bound the principal’s regret under mechanism 2. Theorem 2.
Assume regularity (assumption 1), restrictions on the stage game (assumptions 2, 4, 7), 𝜖 -bounded CIR (assumption 3), and ̃𝜖 -lower-bounded ER (assumption 6). Let 𝜎 ∗ be the uninformed-agent mechanism 2. For any constant ̄𝜖 > , the principal’s expected regret E 𝜎 ∗ [ PR(
𝐿, 𝑦 𝑇 ) ] is atmost 𝑂 ( ̄𝜖 ) ⏟⏟⏟ cost of ̄𝜖 -robustness + 𝑂 ( ̃𝜖 ) ⏟⏟⏟ agent’s information + 1 ̄𝜖 ⋅ ⎛⎜⎜⎜⎜⎝ 𝑂 ( 𝜖 ) ⏟⏟⏟ agent’s regret + ̃𝑂 ( 𝑇 −1∕4 √| || Δ( ) |) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ forecast miscalibration + 𝑂 ( 𝛿 ) ) + 𝑂 ( 𝛿 ) + 𝑂 ( 𝛿 ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ discretization error ⎞⎟⎟⎟⎟⎠ Remark 3.
If we define the partition Δ to be the smallest possible, then theorem 2 implies that theprincipal’s regret vanishes if 𝑇 → ∞ and 𝜖, ̄𝜖, ̃𝜖, 𝛿 Δ( ) , 𝛿 , 𝛿 → at the appropriate rates. It alsofollows from the proof that the principal’s payoffs converge to a natural benchmark: what he wouldhave obtained in a stationary equilibrium of the repeated game where it is common knowledge that 𝑦 𝑡 is drawn independently from the empirical distribution ̂𝜋 𝑃 𝑡 . Formally, 𝑇 𝑇 ∑ 𝑡 =1 𝑉 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) − 1 𝑇 ∑ 𝑃 ∈ 𝑛 𝑃 max 𝑝 ∈ 𝛽 𝑝 ( ̂𝜋 𝑃 , → Mechanism for an Informed Agent
In section 4, we assumed that the principal knows the agent’s learner 𝐿 . The implication of thisassumption is that the principal is as informed as the agent. In section 5, we assumed that the agentis as uninformed as the principal. In this section, we allow the agent to be more informed thanthe principal. This generality comes at a cost: we no longer ensure vanishing principal’s regret.Instead, we show that, in the limit, the following mechanism guarantees regret that is no greaterthan the cost of informational robustness. Mechanism 3.
Let the distribution 𝜋 𝑡 be a forecast of the state 𝑦 𝑡 .• Our forecasting algorithm applies a generic no-internal-regret algorithm due to Blum andMansour (2007) in an auxilliary learning problem where the action space consists of thediscretized forecasts 𝜋 ∈ Δ( ) and the loss function is the negated quadratic scoring rule.Fix a parameter ̄𝜖 > . In period 𝑡 , the informed-agent mechanism 𝜎 ∗ chooses the discretization ofthe ̄𝜖 -informationally-robust policy 𝑝 † ( 𝜋 𝑡 , ̄𝜖 ) that treats the forecast 𝜋 𝑡 as a common prior. Theorem 3 builds on the same reasoning as theorems 1 and 2. First, we need to adapt assumption4 to the case with private signals.
Assumption 8.
Let 𝜖 > . Let 𝜋 and ̃𝜋 be distributions in the stage game. If the 𝜖 -informationally-robust policies under 𝜋 and under ̃𝜋 are close to one another, then they are also close to the 𝜖 -informationally-robust policy under any convex combination of these distributions. Formally, forany 𝜆 ∈ [0 , , 𝑑 ( 𝑝 † ( 𝜋, 𝜖 ) , 𝑝 † ( 𝜆𝜋 + (1 − 𝜆 ) ̃𝜋, 𝜖 ) ) = 𝑂 ( 𝑑 ( 𝑝 † ( 𝜋, 𝜖 ) , 𝑝 † ( ̃𝜋, 𝜖 ) )) Next, recall how, in the previous section, we were concerned that the principal’s policy 𝑝 𝑡 inperiod 𝑡 was calibrated to the empirical distribution ̂𝜋 𝑃 given policy context 𝑃 ∈ (where 𝑡 ∈ 𝑃 ) rather than the empirical distribution ̂𝜋 𝐼 given information 𝐼 = 𝐼 𝑡 . There, we resolved thatproblem by assuming the agent was uninformed (non-negative ER). Here, our solution is evensimpler: choose a policy 𝑝 𝑡 that is robust to the agent’s private information 𝐼 , whatever that maybe. To be more precise, recall that the policy context 𝑃 is coarser than information 𝐼 . We caninterpret periods 𝑡 ∈ 𝐼 as those periods in which the agent received a private signal 𝐼 . By lookingat the frequency of information 𝐼 within policy context 𝑃 , we can define an empirical informationstructure ̂𝛾 𝑃 using Bayes’ rule, i.e. ̂𝛾 𝑃 ( 𝐼 , 𝑦 ) = 𝑛 𝐼 ̂𝜋 𝐼 ( 𝑦 ) 𝑛 𝑃 ̂𝜋 𝑃 ( 𝑦 ) ⋅ ( 𝐼 ⊆ 𝑃 ) where 𝐼 ⊆ 𝑃 is shorthand for 𝑡 ∈ 𝐼 ⟹ 𝑡 ∈ 𝑃 . Before, we could roughly approximate principal’sand agent’s utility as their expected utility in the stage game where the state 𝑦 was drawn from the23mpirical distribution ̂𝜋 𝐼 . Now, the approximation is the expected utility in the stage game where 𝑦 ∼ ̂𝜋 𝑃 and the agent receives private signal 𝐼 from the empirical information structure ̂𝛾 𝑃 . Ofcourse, the principal’s policy 𝑝 𝑡 is robust to all information structures 𝛾 , including ̂𝛾 𝑃 .Next, we formalize this discussion and bound the principal’s regret under mechanism 3. Theorem 3.
Assume regularity (assumption 1), restrictions on the stage game (assumptions 5, 8),and 𝜖 -bounded CIR (assumption 3). Let 𝜎 ∗ be the informed-agent mechanism 3. For any constant ̄𝜖 > , the principal’s expected regret E 𝜎 ∗ [ PR(
𝐿, 𝑦 𝑇 ) ] is at most 𝑇 ∑ 𝑃 ∈ 𝑛 𝑃 ∇( ̂𝜋 𝑃 ) + 𝑂 ( ̄𝜖 ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ cost of ̄𝜖 -informational-robustness + 1 ̄𝜖 ⋅ ⎛⎜⎜⎜⎜⎝ 𝑂 ( 𝜖 ) ⏟⏟⏟ agent’s regret + ̃𝑂 ( 𝑇 −1∕4 √| || Δ( ) |) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ forecast miscalibration + 𝑂 ( 𝛿 ) ) + 𝑂 ( 𝛿 ) + 𝑂 ( 𝛿 ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ discretization error ⎞⎟⎟⎟⎟⎠ Remark 4.
In contrast to our previous results, this regret bound does not vanish. However, if wedefine the partition Δ to be the smallest possible, the bound does converge to 𝑇 ∑ 𝑃 ∈ 𝑛 𝑃 ∇( ̂𝜋 𝑃 ) as 𝑇 → ∞ and 𝜖, ̄𝜖, 𝛿 Δ( ) , 𝛿 , 𝛿 → at the appropriate rates. This is the best possible guaran-tee in a stationary equilibrium of the repeated game where (a) it is common knowledge that 𝑦 𝑡 isdrawn independently from the empirical distribution ̂𝜋 𝑃 𝑡 and (b) the agent has access to an unknowninformation structure 𝛾 . We studied single-agent mechanism design where the common prior assumption is replaced withrepeated interaction and frequent feedback about the world. Our primary motivation was to removea barrier (the common prior) that makes it difficult to implement mechanisms in practice. However,we also want to emphasize that this work can be viewed as a learning foundation for (robust) mech-anism design. Indeed, our results show that policies similar to those predicted by a common priorcan perform well even without making any assumptions about the data-generating process. Thislends credibility to researchers who invoke the common prior for tractability, but do not expect itto be taken literally. However, there are two caveats.1. Our policies are robust to agents that behave suboptimally by up to some 𝜖 > . In contrast,most papers on local robustness involve an optimizing agent with misspecified beliefs (e.g.Artemov et al. 2013; Meyer-ter-Vehn and Morris 2011; Oury and Tercieux 2012). Thesenotions coincide sometimes but not always. In addition, our policies sometimes require in-formational robustness (Bergemann and Morris 2013).24. The number of interactions 𝑇 required for our mechanisms to approximate the static commonprior game may be large. In particular, our bounds depend on features of the stage game, likethe size of the policy and response spaces, and the number of states. These features may alsoaffect the agent’s learning rate, which in turn affects our bounds. In that sense, the commonprior assumption may be less appealing in more games that are more complex. Further Work.
There are several directions in which to generalize and improve this work. Tobegin, it is not clear whether our finite sample bounds have a tight dependence on the number ofperiods 𝑇 and various other parameters. For example, is it possible to remove the exponentialdependence in theorem 1 on the size of the policy space? In addition, there may be opportunitiesfor tightening our results in less abstract settings where the stage game has more structure.Our analysis was restricted to single-agent problems. Suppose there are multiple agents. Fromthe perspective of any one agent, her opponents correspond to adaptive adversaries (c.f. Arora,Dinitz, et al. 2018) whose future behavior is influenced by the agent’s present response. However, ifthe number of participants is large and the mechanism’s outcome preserves the differential privacyof each agent’s response history (c.f. McSherry and Talwar 2007), the behavioral assumptionsdeveloped here may also be suitable for the multi-agent setting.We assumed that the principal and agent observe the state after every interaction, but this maybe unrealistic in many applications. For instance, in contract theory the state is a function fromthe agent’s actions to outcomes. Let us briefly refer to the principal-agent problem in appendixA.2. There, if the agent chooses to work, we do observe whether the project succeeds or not.However, we may not learn whether the project would have succeeded had the agent shirked. Tomitigate this issue, we could consider the case with bandit feedback, where participants observetheir own payoffs but not the state. The challenge with bandit feeback is that it requires responsivemechanisms, as the mechanism must depend on the principal’s payoffs, which in turn depend onthe agent’s response. In section 8, where the agent may be more informed than the principal, the principal’s regretdid not vanish but rather converged to the cost of informational robustness under a common prior.There is reason to believe that this result is not tight. Although the principal will never have accessto the private signal 𝐼 of the agent, he may attempt to learn (via the agent’s past behavior) aboutthe information structure 𝛾 that generates it. In turn, the agent may anticipate this and attempt tomanipulate the principal’s policy by feigning (partial) ignorance of her private signal. This suggestsa less conservative definition of informational robustness, where the principal learns the quality ofany information that the agent decides to exploit. However, in the repeated game, this approachwould require responsive mechanisms. One approach the principal might take is to attempt to discern the agent’s beliefs from the description of her learner 𝐿 , and substitute those beliefs for his own forecast. If successful, this would tie the principal’s forecast miscalibrationto the agent’s counterfactual internal regret. Relatedly, Balcan, Blum, Haghtalab, et al. (2015) consider a repeated Stackelberg game where the state is theagent’s private type. The principal receives bandit feedback: he never observes the type directly but can infer it fromthe agent’s behavior. The issues associated with responsiveness do not arise in this model as the agent is myopic (ormore precisely, there is a sequence of short-lived agents).
25s the last two paragraphs illustrate, we need a theory of behavior for responsive mechanisms.The no-regret conditions used here and elsewhere are not as well-motivated when the mechanism(or adversary) is responsive, insofar as they do not generalize traditional rationality assumptions.Extending the logic of no-regret conditions to a larger set of mechanisms – but not necessarily allmechanisms – is a clear priority for further work.
References
Anunrojwong, J., Iyer, K., & Manshadi, V. (2020). Information design for congested social services:optimal need-based persuasion. In
Proceedings of the 21st acm conference on economics andcomputation (pp. 349–350). EC ’20. Virtual Event, Hungary: Association for ComputingMachinery.Arora, R., Dekel, O., & Tewari, A. (2012). Online bandit learning against an adaptive adversary:from regret to policy regret. In
Proceedings of the 29th international coference on inter-national conference on machine learning (pp. 1747–1754). ICML’12. Edinburgh, Scotland:Omnipress.Arora, R., Dinitz, M., Marinov, T. V., & Mohri, M. (2018). Policy regret in repeated games. In
Proceedings of the 32nd international conference on neural information processing systems (pp. 6733–6742). NIPS’18. Montréal, Canada: Curran Associates Inc.Artemov, G., Kunimoto, T., & Serrano, R. (2013). Robust virtual implementation: Toward a rein-terpretation of the Wilson doctrine.
Journal of Economic Theory , (2), 424–447.Balcan, M.-F., Blum, A., Haghtalab, N., & Procaccia, A. D. (2015). Commitment without regrets:online learning in stackelberg security games. In Proceedings of the sixteenth acm conferenceon economics and computation (pp. 61–78). EC ’15. Portland, Oregon, USA: Association forComputing Machinery.Balcan, M.-F., Blum, A., Hartline, J. D., & Mansour, Y. (2008). Reducing mechanism design toalgorithm design via machine learning.
Journal of Computer and System Sciences , (8),1245–1270.Bergemann, D., Brooks, B., & Morris, S. (2017). First-price auctions with general informationstructures: implications for bidding and revenue. Econometrica , (1), 107–143.Bergemann, D. & Morris, S. (2013). Robust predictions in games with incomplete information. Econometrica , (4), 1251–1308.Blum, A., Gunasekar, S., Lykouris, T., & Srebro, N. (2018). On preserving non-discriminationwhen combining expert advice. In Proceedings of the 32nd international conference on neu-ral information processing systems (pp. 8386–8397). NIPS’18. Montréal, Canada: CurranAssociates Inc.Blum, A., Hajiaghayi, M., Ligett, K., & Roth, A. (2008). Regret minimization and the price oftotal anarchy. In
Proceedings of the fortieth annual acm symposium on theory of computing (pp. 373–382). STOC ’08. Victoria, British Columbia, Canada: ACM.26lum, A. & Mansour, Y. (2007). From external to internal regret.
J. Mach. Learn. Res. 8 , 1307–1324.Boutilier, C. (2012). Eliciting forecasts from self-interested experts: scoring rules for decision mak-ers. In
Proceedings of the 11th international conference on autonomous agents and multia-gent systems - volume 2 (pp. 737–744). AAMAS ’12. Valencia, Spain: International Founda-tion for Autonomous Agents and Multiagent Systems.Braverman, M., Mao, J., Schneider, J., & Weinberg, M. (2018). Selling to a no-regret buyer. In
Proceedings of the 2018 acm conference on economics and computation (pp. 523–538). EC’18. Ithaca, NY, USA: ACM.Buşoniu, L., Babuška, R., & De Schutter, B. (2010). Multi-agent reinforcement learning: an overview.In D. Srinivasan & L. C. Jain (Eds.),
Innovations in multi-agent systems and applications - 1 (pp. 183–221). Berlin, Heidelberg: Springer Berlin Heidelberg.Carroll, G. (2015). Robustness and linear contracts.
American Economic Review , (2), 536–63.Cesa-Bianchi, N. & Lugosi, G. (2006). Prediction, learning, and games . New York, NY, USA:Cambridge University Press.Cole, R. & Roughgarden, T. (2014). The sample complexity of revenue maximization. In
Proceed-ings of the forty-sixth annual acm symposium on theory of computing (pp. 243–252). STOC’14. New York, New York: ACM.Cummings, R., Devanur, N. R., Huang, Z., & Wang, X. (2020). Algorithmic price discrimination.In
Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms .SODA ’20. Salt Lake City, Utah, USA.Das, S., Kamenica, E., & Mirka, R. (2017). Reducing congestion through information design. In (pp. 1279–1284).Daskalakis, C. & Syrgkanis, V. (2016). Learning in auctions: regret is hard, envy is easy. In (pp. 219–228).Deng, Y., Schneider, J., & Sivan, B. (2019). Strategizing against no-regret learners. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.),
Advances inneural information processing systems 32 (pp. 1579–1587). Curran Associates, Inc.Devanur, N. R., Peres, Y., & Sivan, B. (2019). Perfect bayesian equilibria in repeated sales.
Gamesand Economic Behavior , , 570–588.Dudík, M., Haghtalab, N., Luo, H., Schapire, R. E., Syrgkanis, V., & Vaughan, J. W. (2017). Oracle-efficient online learning and auction design. In (pp. 528–539).Dughmi, S. & Xu, H. (2016). Algorithmic Bayesian persuasion. In Proceedings of the forty-eighthannual acm symposium on theory of computing (pp. 412–425). STOC ’16. Cambridge, MA,USA: ACM.Dütting, P., Roughgarden, T., & Talgam-Cohen, I. (2019). Simple versus optimal contracts. In
Pro-ceedings of the 2019 acm conference on economics and computation (pp. 369–387). EC ’19.Phoenix, AZ, USA: ACM. 27ly, J. C. & Szydlowski, M. (2020). Moving the goalposts.
Journal of Political Economy , (2),468–506.Foster, D. P. & Vohra, R. V. (1997). Calibrated learning and correlated equilibrium. Games andEconomic Behavior , (1), 40–55.Gale, D. & Shapley, L. S. (1962). College admissions and the stability of marriage. The AmericanMathematical Monthly , (1), 9–15.Gofer, E. & Mansour, Y. (2016). Lower bounds on individual sequence regret. Machine Learning , (1), 1–26.Goldstein, I. & Leitner, Y. (2018). Stress tests and information disclosure. Journal of EconomicTheory , , 34–69.Hart, S. & Mas-Colell, A. (2001). A general class of adaptive strategies. Journal of EconomicTheory , (1), 26–54.Hartline, J. D., Johnsen, A., Nekipelov, D., & Zoeter, O. (2019). Dashboard mechanisms for onlinemarketplaces. In Proceedings of the 2019 acm conference on economics and computation (pp. 591–592). EC ’19. Phoenix, AZ, USA: ACM.Hartline, J. D., Syrgkanis, V., & Tardos, É. (2015). No-regret learning in bayesian games. In
Pro-ceedings of the 28th international conference on neural information processing systems -volume 2 (pp. 3061–3069). NIPS’15. Montreal, Canada: MIT Press.Hu, J. & Wellman, M. P. (1998). Multiagent reinforcement learning: theoretical framework andan algorithm. In
Proceedings of the fifteenth international conference on machine learning (pp. 242–250). ICML ’98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.Immorlica, N., Lucier, B., Pountourakis, E., & Taggart, S. (2017). Repeated sales with multiplestrategic buyers. (pp. 167–168). EC ’17. Cambridge, Massachusetts, USA: Association forComputing Machinery.Immorlica, N., Mao, J., Slivkins, A., & Wu, Z. S. (2020). Incentivizing exploration with selectivedata disclosure. In
Proceedings of the 21st acm conference on economics and computation (pp. 647–648). EC ’20. Virtual Event, Hungary: Association for Computing Machinery.Jehiel, P., Meyer-ter-Vehn, M., & Moldovanu, B. (2012). Locally robust implementation and itslimits.
Journal of Economic Theory , (6), 2439–2452.Jose, V. R. R., Nau, R. F., & Winkler, R. L. (2008). Scoring rules, generalized entropy, and utilitymaximization. Operations Research , (5), 1146–1157.Kamenica, E. & Gentzkow, M. (2011). Bayesian persuasion. American Economic Review , (6),2590–2615.Kearns, M., Mansour, Y., & Ng, A. Y. (1999). Approximate planning in large pomdps via reusabletrajectories. In Proceedings of the 12th international conference on neural information pro-cessing systems (pp. 1001–1007). NIPS’99. Denver, CO: MIT Press.Littlestone, N. & Warmuth, M. (1994). The weighted majority algorithm.
Information and Com-putation , (2), 212–261.Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the eleventh international conference on international conference on machine earning (pp. 157–163). ICML’94. New Brunswick, NJ, USA: Morgan Kaufmann PublishersInc.Mansour, Y., Slivkins, A., Syrgkanis, V., & Wu, Z. S. (2016). Bayesian exploration: incentivizingexploration in bayesian games. In Proceedings of the 2016 acm conference on economics andcomputation (p. 661). EC ’16. Maastricht, The Netherlands.McCarthy, J. (1956). Measures of the value of information.
Proceedings of the National Academyof Sciences , (9), 654–655.McSherry, F. & Talwar, K. (2007). Mechanism design via differential privacy. In Proceedings ofthe 48th annual ieee symposium on foundations of computer science (pp. 94–103). FOCS’07. USA: IEEE Computer Society.Meyer-ter-Vehn, M. & Morris, S. (2011). The robustness of robust implementation.
Journal ofEconomic Theory , (5), 2093–2104.Mirrlees, J. A. (1971). An exploration in the theory of optimum income taxation. The Review ofEconomic Studies , (2), 175–208.Morgenstern, J. & Roughgarden, T. (2015). The pseudo-dimension of near-optimal auctions. In Proceedings of the 28th international conference on neural information processing systems- volume 1 (pp. 136–144). NIPS’15. Montreal, Canada: MIT Press.Myerson, R. B. (1981). Optimal auction design.
Mathematics of Operations Research , (1), 58–73.Nekipelov, D., Syrgkanis, V., & Tardos, E. (2015). Econometrics for learning agents. In Proceedingsof the sixteenth acm conference on economics and computation (pp. 1–18). EC ’15. Portland,Oregon, USA: ACM.Ollár, M. & Penta, A. (2017). Full implementation and belief restrictions.
American EconomicReview , (8), 2243–77.Oury, M. & Tercieux, O. (2012). Continuous implementation. Econometrica , (4), 1605–1637.Ross, S. A. (1973). The economic theory of agency: the principal’s problem. The American Eco-nomic Review , (2), 134–139.Roth, A. E. (1982). The economics of matching: stability and incentives. Mathematics of OperationsResearch , (4), 617–628.Ryabko, D. & Hutter, M. (2008). On the possibility of learning in reactive environments with arbi-trary dependence. Theoretical Computer Science , (3), 274–284.Sappington, D. (1983). Limited liability contracts between principal and agent. Journal of Eco-nomic Theory , (1), 1–21.Spence, M. & Zeckhauser, R. (1971). Insurance, information, and individual action. The AmericanEconomic Review , (2), 380–387.Syrgkanis, V. (2017). A sample complexity measure with applications to learning optimal auctions.In Proceedings of the 31st international conference on neural information processing systems (pp. 5358–5365). NIPS’17. Long Beach, California, USA: Curran Associates Inc.Uther, W. & Veloso, M. (2003).
Adversarial reinforcement learning .Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders.
The Journal ofFinance , (1), 8–37. 29 Special Cases
A.1 Bayesian Persuasion
There is an informed sender (i.e. principal) and an uninformed receiver (i.e. agent). The principaldesigns the process by which information is revealed to the agent. Let be a finite set of messagesthat he can send. Knowing that the agent will react to an informative message, the principal attemptsto persuade the agent towards actions that he prefers. Let be a finite set of actions that the agentcan take. The agent chooses a response 𝑟 ∶ → that maps messages to actions.A policy is an information structure 𝑝 ∶ → Δ( ) . That is, an information structure 𝑝 𝑡 ( 𝑦 𝑡 ) describes the probability of a message 𝑚 𝑡 being sent, conditional on the state being 𝑦 𝑡 . The agentreceives the message 𝑚 𝑡 and takes action 𝑎 𝑡 = 𝑟 𝑡 ( 𝑚 𝑡 ) . While the agent may not know the processthat generated the state 𝑦 𝑡 , she understands the process 𝑝 𝑡 that generates the message 𝑚 𝑡 conditionalon the state. Armed with this understanding, she can infer something about the state 𝑦 𝑡 based onthe message 𝑚 𝑡 .All that remains is to specify payoffs. Let 𝑢 ∶ × → ℝ be the agent’s utility function from agiven action in a given state. Similarly, let 𝑣 ∶ × → ℝ be the principal’s utility. In the previoussubsection, the utility functions 𝑈 , 𝑉 depended on the triple ( 𝑟, 𝑝, 𝑦 ) rather than the pair ( 𝑎, 𝑦 ) . Toreconcile the two models, we let participants evaluate ( 𝑟, 𝑝, 𝑦 ) by their expected utility conditionalon the state. Formally, 𝑈 ( 𝑟, 𝑝, 𝑦 ) = ∑ 𝑚 ∈ 𝑝 ( 𝑚, 𝑦 ) ⋅ 𝑢 ( 𝑟 ( 𝑚 ) , 𝑦 ) and 𝑉 ( 𝑟, 𝑝, 𝑦 ) = ∑ 𝑚 ∈ 𝑝 ( 𝑚, 𝑦 ) ⋅ 𝑣 ( 𝑟 ( 𝑚 ) , 𝑦 ) When the state is fixed, the residual variation in utility is due to the fact that messages are drawnrandomly from the distribution 𝑝 ( 𝑦 ) . These distributions are common knowledge because the agentobserves the principal’s policy 𝑝 before taking an action. Indeed, the fact that the principal commitsto an information structure is the defining feature of Bayesian persuasion. Example 1 (Judge-Prosecutor Game) . The state space is = {Innocent , Guilty} and the actionspace is = {Convict , Acquit} . The judge has 0-1 utility 𝑢 and prefers to convict if the defendantis guilty and acquit if the defendant is innocent. Regardless of the state, the prosecutor’s utility 𝑣 is 1 following a conviction and 0 following an acquittal. This example satisfies regularity (1) with the discrete metric on , the 𝑙 -metric on , and 𝐾 𝑈 = 𝐾 𝑉 = 𝐾 𝑈 = 𝐾 𝑉 = 1 .The worst-case policy 𝑝 ∗ ( 𝜋, 𝜖 ) sends the message “convict” whenever the defendant is guilty.If the defendant is innocent, it sends the message “convict” with probability 𝑞 = max { , min { , 𝑝 − 𝜖 𝑝 }} The cost of 𝜖 -robustness Δ( 𝜋, 𝜖 ) = 𝑂 ( 𝜖 ) decreases smoothly with 𝜖 . This game satisfies assump-tion ?? since 𝑞 is increasing in 𝑝 (and hence convex combinations of distributions 𝑝 will yield 𝑞 𝜖 -robust policies for the extremal distributions, which are close byassumption).The worst-case policy 𝑝 † ( 𝜋, 𝜖 ) for an unknown private signal is full transparency. The cost ofinformational robustness, i.e. ∇( 𝜋, , is the difference between the principal’s value under thecommon prior 𝜋 and his payoff under full transparency. This game satisfies assumption ?? with 𝑀 = 1 and 𝑀 = 𝑂 ( 𝜖 ) . It trivially satisfies assumptions ?? and ?? since 𝑝 † is constant. A.2 Contract Design
In classic models of moral hazard, the principal incentivizes an agent to put effort into a task theprincipal cares about. The timing of the game is as follows: (1) the principal commits to a contract,(2) the agent takes a hidden action, (3) nature randomly chooses an outcome, (4) the agent is paidbased on the outcome, (5) the game concludes. For concreteness, we consider the limited liabilitymodel due to Sappington (1983) where both participants are risk-neutral but the principal is notallowed to charge the agent. This model has been popularized by recent work in robust contractdesign (see e.g. Carroll 2015, Dütting et al. 2019).Formally, let be a finite set of actions that the agent can take. Let be a finite set of outcomes 𝑜 . The principal observes the outcome but not the action. The state 𝑦 ∶ → describes howactions map to outcomes. The employer commits to a contract 𝑝 ∶ → [0 , ̄𝑝 ] that specifies anon-negative payment for each outcome. The cost function 𝑐 ∶ → ℝ describes how costly it isfor the agent to take a particular action. The agent’s utility function is 𝑈 ( 𝑟, 𝑝, 𝑦 ) = 𝑝 ( 𝑦 ( 𝑟 )) − 𝑐 ( 𝑟 ) The benefit function 𝑏 ∶ → ℝ describes how beneficial a given outcome is to the principal. Theprincipal’s utility function is 𝑉 ( 𝑟, 𝑝, 𝑦 ) = 𝑏 ( 𝑦 ( 𝑟 )) − 𝑝 ( 𝑦 ( 𝑟 )) Through the contract 𝑝 , the principal can incentivize the agent to take actions that, depending onthe state, will lead to a more beneficial outcome. Example 2.
The agent is given a task of unknown difficulty. There are two actions = {work , shirk} ,two outcomes = {success , failure} , and three states = {trivial , moderate , impossible} . In thetrivial state, both actions lead to success. In the impossible state, both actions lead to failure. Inthe moderate state, work leads to success and shirk leads to failure.The principal’s benefits are 𝑏 (success) = 2 and 𝑏 (failure) = 0 . The agent’s costs are 𝑐 (work) =1 and 𝑐 (shirk) = 0 . In the impossible and trivial states, the optimal contract pays nothing after bothoutcpmes and the agent will shirk. In the moderate state, the optimal contract pays 𝑝 (success) = 5 to cover the agent’s costs if she works, otherwise 𝑝 (failure) = 0 . Generally, if the principal paysthe agent after success, the agent will have to take into account the risk that the task turns out to beimpossible (where work induces costs without any payment) or trivial (where work is not requiredfor payment). To incentivize work, the contract must compensate the agent accordingly. 𝑈 , 𝑉 normalized, the discrete metric on , the sup-norm-metric on , and 𝐾 𝑈 = 𝐾 𝑉 = 𝐾 𝑈 = 𝐾 𝑉 = 1 .The worst-case policy 𝑝 ∗ ( 𝜋, 𝜖 ) sets 𝑝 (failure) = 0 and 𝑝 (success) = 𝑐 (work) − 𝑐 (shirk) + 𝜖𝜋 (moderate) so long as 𝑝 (success) ≤ ̄𝑝 and the principal’s 𝜋 -expected payoff is greater than zero when theagent works. Otherwise, the worst-case policy sets all transfers to zero. The cost of 𝜖 -robustness Δ( 𝜋, 𝜖 ) = 𝑂 ( 𝜖 ) decreases smoothly with 𝜖 .This game satisfies assumption ?? . To see this, note that as long as working is strictly morecostly than shirking, the optimal policies that induce effort are bounded away from the optimalpolicies that do not. Among the policies that do not induce effort, convex combinations of thedistribution will not make inducing effort desirable. Among policies that do induce effort, the factthat the payments following success are decreasing in 𝜋 (moderate) means (as in the last section)that convex combinations of distributions lead to optimal policies that are between the extremalpolicies.The worst-case policy 𝑝 † ( 𝜋, 𝜖 ) for an unknown private signal is the same as the optimal policyunder a common prior without a private signal. The cost of informational robustness, i.e. ∇( 𝜋, ,is the difference between the principal’s value when the agent only works in the “moderate” stateand the principal pays her cost of effort conditional on success (assuming the principal prefers thisto shirking with zero transfers) and his value in the common prior game without a private signal.This game satisfies assumption ?? with 𝑀 = 𝑂 ( 𝜖 ) and 𝑀 = 0 . B Agent’s Learning Problem
Upper bounds on external regret are often viewed as compelling assumptions (e.g. Nekipelov etal. 2015, Braverman et al. 2018) because there exist relatively simple algorithms that guaranteevanishing ER as 𝑇 → ∞ . For example, the exponential weights algorithm (a.k.a. hedge algorithm,exponentiated gradient algorithm) satisfies no-ER. In contrast, our behavioral assumptions – e.g.no-FCIR – may appear daunting, insofar as the agent must solve a learning problem with a contextspace that is exponential in the number of alternative mechanisms, | Σ | . When both the sequenceof states 𝑦 𝑇 and the learner 𝐿 are particularly pathological, no-FCIR may indeed be too strongan assumption. When the the learner satisfies additional properties, or sequence of states has somestochastic structure (e.g. is i.i.d. or Markov), no-FCIR may be more reasonable.In this section, we make one simple observation. There exists a learner that guarantees no-FCIR (and hence no-CIR) for the agent under our mechanism from theorem 2, with the best rate ofconvergence we can hope for. Definition 15 (CFL) . Suppose the principal publicizes the forecast 𝜋 𝑡 in every period 𝑡 . The In our view, part of the principal’s objective is to make the agent’s problem as simple as possible. From a worst- (CFL) sets 𝑟 𝑡 ∈ arg max 𝑟 ∈ E 𝑦 ∼ 𝜋 𝑡 [ 𝑈 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ] Proposition 3 verifies that the CFL satisfies the behavioral assumptions of theorem 2.
Proposition 3.
Let 𝜎 ∗ be the mechanism from theorem 2. Then the CFL satisfies 𝜖 -bounded FCIR (4) in expectation, i.e. E 𝐿,𝜎 ∗ [FCIR] ≤ 𝜖 = ̃𝑂 ( 𝑇 √| || |) + 𝑂 (√| | 𝛿 ) and ̃𝜖 -lower-bounded FER (5) , where ̃𝜖 = 0 . Moreover, if the agent uses CFL, the principal’s regretbound in theorem 2 applies regardless of whether alignment (9) holds. Note that these rates preserve the 𝑇 convergence rate (up to 𝛿 error) that is present in all ofour mechanisms and reflects miscalibration of the principal. In that sense, the fact that the agent isalso learning does not deteriorate the principal’s performance at all. Although this has not been ouremphasis so far, it would be interesting to see whether (or identify conditions under which) othersimple learning algorithms satisfy our behavioral assumptions with decent rates of convergence. C Calibrated Forecasting
In this appendix, we describe our forecasting algorithm and bound its miscalibration.A linearly homogeneous, differentiable function 𝐻 is strongly convex with parameter 𝜉 if 𝐻 ( 𝜋 ) ≥ 𝐻 ( ̃𝜋 ) + ∇ 𝐻 ( ̃𝜋 ) ⋅ ( 𝜋 − ̃𝜋 ) + 𝜉 ‖ 𝜋 − ̃𝜋 ‖ The gradient of 𝐻 describes a proper scoring rule 𝑆 ( 𝜋 ) = ∇ 𝐻 ( 𝜋 ) where 𝐻 ( 𝜋 ) = 𝜋 ⋅ 𝑆 ( 𝜋 ) (Mc-Carthy 1956). A scoring rule 𝑆 ∶ Δ( ) → ℝ is proper if the report ̃𝜋 that maximizes the 𝜋 -expected score is the distribution 𝜋 . Strong convexity of 𝐻 can be thought of as sharpening theincentives for truth-telling (Boutilier 2012).Specifically, consider the quadratic scoring rule (see e.g. Jose et al. 2008) 𝑆 𝑦 ( 𝜋 ) = 2 𝜋 ( 𝑦 ) − ∑ ̃𝑦 ∈ 𝜋 ( ̃𝑦 ) where 𝐻 ( 𝜋 ) = ‖ 𝜋 ‖ is strongly convex with 𝜉 = 2 .Recall that the mechanism 𝜎 ∗ is supposed to be nonresponsive. As a consequence, we cannotdetermine the principal’s beliefs 𝜋 𝑡 in a given period based on his historical payoffs. To ensure that case perspective, there is no benefit to hiding this information. With that said, we see no reason why this result shouldnot apply under the weaker assumption that the principal’s forecasting algorithm is public knowledge. 𝜋 𝑡 are well-calibrated, we consider an auxilliary online learning problem based on a scor-ing rule 𝑆 . In period 𝑡 , the principal makes a prediction 𝜋 𝑡 with loss function 𝑆 𝑦 𝑡 ( 𝜋 𝑡 ) . Specifically,the predictions come from the discretized set of priors , formed by choosing a representativeelement 𝜋 from each set in the partition . In terms of the score, this approximation has limitedcost. Let 𝜋 = [ ̂𝜋 𝐹 ] be the belief 𝜋 ∈ that is closest to the empirical distribution ̂𝜋 𝐹 . Then 𝑆 𝑦 ( ̂𝜋 𝐹 ) − 𝑆 𝑦 ( 𝜋 ) ≤ 𝑆 𝑦 ( ̂𝜋 𝐹 ) − 𝑆 𝑦 ( ̂𝜋 𝐹 − 𝛿 )= 2 ̂𝜋 𝐹 ( 𝑦 ) − ∑ ̃𝑦 ∈ ̂𝜋 𝐹 ( ̃𝑦 ) − 2( ̂𝜋 𝐹 ( 𝑦 ) − 𝛿 ) + ∑ ̃𝑦 ∈ ( ̂𝜋 𝐹 ( ̃𝑦 ) − 𝛿 ) = 2 𝛿 − ∑ ̃𝑦 ∈ ̂𝜋 𝐹 ( ̃𝑦 ) + ∑ ̃𝑦 ∈ ( ̂𝜋 𝐹 ( ̃𝑦 ) − 𝛿 ) ≤ 𝛿 (6)where ̂𝜋 𝐹 − 𝛿 is shorthand notation for the vector ( ̂𝜋 𝐹 ( 𝑦 ) − 𝛿 ) 𝑦 ∈ .In this auxilliary problem, the exponential weights algorithm (see e.g. Cesa-Bianchi and Lugosi2006) obtains expected external regret at most √ 𝑇 log | 𝑆 | relative to the best-in-hindsight 𝜋 ∗ 𝐹 ∈ . A reduction due to Blum and Mansour (2007) (theorem5) translates this into a bound on expected internal regret of | 𝑆 |√ 𝑇 log | 𝑆 | relative to the best-in-hindsight 𝜋 ∗ 𝐹 ∈ . Combine this with the maximum approximation error (6)to bound the expected internal regret relative to the best-in-hindsight contextual belief 𝜋 ∈ Δ( ) ,which must be the empirical distribution ̂𝜋 𝐹 since 𝑆 is proper. Specifically, | 𝑆 |√ 𝑇 log | 𝑆 | + 2 𝑇 𝛿 ≥ E [ ∑ 𝜋 ∈ 𝑛 𝐹 ̂𝜋 𝐹 ⋅ ( 𝑆 ( ̂𝜋 𝐹 ) − 𝑆 ( 𝜋 ) )] (7)where 𝑛 𝐹 is the number of periods 𝑡 where [ 𝜋 𝑡 ] = 𝐹 𝑡 . This is a statement about the expectedscoring loss, where the expectation reflects randomization in the algorithm. Our next result, lemma3, translates this into a statement about the 𝑙 distance between the principal’s belief 𝜋 and theempirical distribution ̂𝜋 𝐹 . Lemma 3.
Let 𝑆 be a proper scoring rule where the optimal expected score 𝐻 is 𝜉 -strongly convex.Then √ | | 𝜅𝜉 ≥ 𝑇 ∑ 𝜋 ∈ 𝑛 𝐹 𝑑 ( 𝜋, ̂𝜋 𝐹 ) here 𝜅 = 1 𝑇 ∑ 𝜋 ∈ 𝑛 𝐹 ̂𝜋 𝐹 ⋅ ( 𝑆 ( ̂𝜋 𝐹 ) − 𝑆 ( 𝜋 ) ) Proof.
Consider the principal’s 𝜋 -expected regret from predicting ̃𝜋 : 𝜋 ⋅ ( 𝑆 ( 𝜋 ) − 𝑆 ( ̃𝜋 )) = 𝐻 ( 𝜋 ) − 𝜋 ⋅ ∇ 𝐻 ( ̃𝜋 ) ≥ 𝐻 ( ̃𝜋 ) − ∇ 𝐻 ( ̃𝜋 ) ⋅ ̃𝜋 + 𝜉 ‖ 𝜋 − ̃𝜋 ‖ = 𝜉 ‖ 𝜋 − ̃𝜋 ‖ ≥ 𝜉 ( √| | ‖ 𝜋 − ̃𝜋 ‖ ) = 𝜉 | | ‖ 𝜋 − ̃𝜋 ‖ where the second-to-last line follows from ‖ ⋅ ‖ ≤ | | ‖ ⋅ ‖ . It follows that his regret in theauxilliary problem satisfies 𝜅 ≥ 𝑇 ∑ 𝜋 ∈ 𝑛 𝐹 𝜉 | | 𝑑 ( 𝜋, ̂𝜋 𝐹 ) where, implicitly, 𝐹 is the forecast context such that 𝜋 ∈ 𝐹 . Take the square root of both sides ofthis inequality: √ 𝜅 ≥ √ 𝑇 ∑ 𝜋 ∈ 𝑛 𝐹 𝜉 | | 𝑑 ( 𝜋, ̂𝜋 𝐹 ) ≥ √ 𝑇 ⋅ √ 𝑇 ∑ 𝜋 ∈ 𝑛 𝐹 √ 𝜉 | | 𝑑 ( 𝜋, ̂𝜋 𝐹 ) ≥ √ 𝜉 | | ⋅ 𝑇 ∑ 𝜋 ∈ 𝑛 𝐹 𝑑 ( 𝜋, ̂𝜋 𝐹 ) where the first line is the 𝑙 norm of a vector with 𝑇 entries, the second line is the 𝑙 norm of thatsame vector, and the inequality follows from ‖ ⋅ ‖ ≤ 𝑇 ‖ ⋅ ‖ . Collapse these inequalities andrearrange terms to obtain the desired result.If we use the quadratic scoring rule, this lemma implies E [ 𝑇 ∑ 𝜋 ∈ 𝑛 𝐹 𝑑 ( 𝜋, ̂𝜋 𝐹 ) ] ≤ √| || 𝑆 |√ | 𝑆 | 𝑇 + 2 | | 𝛿 (8)To optimize this bound, up to log factors, set 𝛿 = ( 𝑇 ) | | +2 , assuming | 𝑆 | = (( 𝛿 ) | | ) .35 Generalized Results
Upper bounds on CIR constitute our rationality assumptions for the agent. However, our results alsorely on informational assumptions. Sections 4, 5, and 6 consider environments that differ primarilyby how “informed” the agent appears, relative to the principal. In all three cases, however, werequire the agent to be at least as informed as the principal. What is the principal’s information?Recall that our mechanisms 𝜎 ∗ will be forecast mechanisms ( ?? ). A calibrated learning algorithm– which we specify later on – will produce a sequence of forecasts 𝜋 , … , 𝜋 𝑇 . It is possible thatthese forecasts will become correlated with the state, e.g. if there is a trend in the data. We do notrule this out; however, if our forecasts inadvertently pick up useful information, this informationshould be available to the agent as well (either implicitly or because we publish 𝜋 𝑡 along with 𝑝 𝑡 ).The notion of forecastwise regret (and forecastwise CIR) formalizes what we mean by the prin-cipal’s “information” being available to the agent. The agent’s benchmark includes the principal’sforecast as additional context. Formally, define the forecast space = Δ( ) . Fix a small constant 𝛿 > and consider a finite partition 𝑆 of where 𝜋, ̃𝜋 ∈ 𝐹 ∈ 𝑆 implies 𝑑 ∞ ( 𝜋, ̃𝜋 ) ≤ 𝛿 . Let ⊆ contain a single distribution 𝜋 ∈ 𝐹 for every 𝐹 ∈ 𝑆 . Definition 16.
Let the information partition combine the forecast and CIR context, i.e. = 𝑆 × ( 𝑆 ) Σ and let the information 𝐼 𝑡 ∈ in period 𝑡 be the unique set that satisfies ( 𝜋 𝑡 , 𝑟 ∗ 𝑡 , ( 𝑟 𝑝𝑡 ) 𝑝 ∈ ) ∈ 𝐼 𝑡 Definition 17 (FCIR) . The agent’s forecastwise CIR relative to a modification rule ℎ ∶ → is FCIR( ℎ ) = 1 𝑇 𝑇 ∑ 𝑡 =1 ( 𝑈 ( ℎ ( 𝐼 𝑡 ) , 𝑝 𝑡 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ) The FCIR relative to the best-in-hindsight modification rule is
FCIR = max ℎ ∶ → FCIR( ℎ ) . To state our assumption, we need to define a forecastwise version of ER, just as we defined aforecastwise version of CIR at the end of section 3. Let the forecast context 𝐹 𝑡 ∈ 𝑆 in period 𝑡 bethe unique set that satisfies 𝜋 𝑡 ∈ 𝐹 𝑡 . Definition 18 (FER) . The agent’s forecastwise external regret relative to a strategy ℎ ∶ 𝑆 → is FER( ℎ ) = 1 𝑇 𝑇 ∑ 𝑡 =1 ( 𝑈 ( ℎ ( 𝐹 𝑡 ) , 𝑝 𝑡 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ) The FER relative to the best-in-hindsight strategy is
FER = max ℎ ∶ → FER( ℎ ) . Theorem 4.
Assume regularity (1) and 𝜖 -bounded FCIR (4) . There exists a nonresponsive mech-anism 𝜎 ∗ parameterized by the agent’s learner 𝐿 and a constant ̄𝜖 > such that . The principal’s regret is bounded, i.e. E 𝜎 ∗ [PR] ≤ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 Δ( ̂𝜋 𝐼 , ̄𝜖 ) + 1 ̄𝜖 ( 𝑂 ( 𝜖 ) + ̃𝑂 ( 𝑇 −1∕4 √| || 𝑆 || 𝑅 | ( | Σ | + | 𝑆 | )∕2 ) + 𝑂 ( 𝛿 ) + 𝑂 ( 𝛿 ) ) Assumption 9 (Alignment) . The stage game is ( 𝜖, 𝑀 , 𝑀 ) -aligned if, for all signals 𝛾 , ( 𝜙 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝜖 ) − 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ maximum downside of 𝛾 for the principal ≤ 𝑀 max 𝑟,𝑟 𝐽 ∈ E 𝑦 ∼ 𝜋 [ E 𝐽 ∼ ̃𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( 𝑟 𝐽 , 𝑝 ∗ ( 𝜋, 𝜖 ) , 𝑦 ) ] − 𝑈 ( 𝑟, 𝑝 ∗ ( 𝜋, 𝜖 ) , 𝑦 ) ] ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ usefulness of 𝛾 to the agent + 𝑀 and, for all policies 𝑝 ∈ , ( 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 ) − 𝜙 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝜖 ) ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ maximum upside of 𝛾 for the principal ≤ 𝑀 max 𝑟,𝑟 𝐽 ∈ E 𝑦 ∼ 𝜋 [ E 𝐽 ∼ ̃𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( 𝑟 𝐽 , 𝑝, 𝑦 ) ] − 𝑈 ( 𝑟, 𝑝, 𝑦 ) ] ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ usefulness of 𝛾 to the agent + 𝑀 Theorem 5.
Assume regularity (1) , 𝜖 -bounded FCIR (4) , ̃𝜖 -lower-bounded FER (5) , and ( ̄𝜖, 𝑀 , 𝑀 )-alignment (9) . There exists a nonresponsive mechanism 𝜎 ∗ parameterized by ̄𝜖 such that1. The principal’s regret is bounded, i.e. E 𝜎 ∗ [PR] ≤ 𝑇 ∑ 𝐹 ∈ 𝑆 𝑛 𝐹 Δ( ̂𝜋 𝐹 , ̄𝜖 ) + 1 ̄𝜖 ( 𝑂 ( 𝜖 ) + ̃𝑂 ( 𝑇 −1∕4 √| || 𝑆 |) + 𝑂 (√ 𝛿 ) + 𝑂 ( 𝛿 ) ) + 𝑂 ( ̃𝜖 ) + 𝑀 ( 𝑂 ( ̃𝜖 ) + 𝑂 ( 𝜖 ) + ̃𝑂 ( 𝑇 −1∕4 √| || 𝑆 |) + 𝑂 (√ 𝛿 ) + 𝑂 ( 𝛿 ) ) + 𝑂 ( 𝑀 ) Theorem 6.
Assume regularity (1) and 𝜖 -bounded FCIR (4) . There exists a nonresponsive mech-anism 𝜎 ∗ parameterized by a constant ̄𝜖 > such that1. The principal’s regret is bounded, i.e. E 𝜎 ∗ [PR] ≤ 𝑇 ∑ 𝐹 ∈ 𝑆 𝑛 𝐹 Δ( ̂𝜋 𝐹 , ̄𝜖 ) + 1 ̄𝜖 ( 𝑂 ( 𝜖 ) + ̃𝑂 ( 𝑇 −1∕4 √| || 𝑆 |) + 𝑂 ( 𝛿 ) + 𝑂 ( 𝛿 ) ) E Omitted Proofs
E.1 Proof of Propositions ?? and ??
Recall that a policy 𝑝 𝑡 in period 𝑡 can affect the agent’s behavior 𝜇 𝜏 in period 𝜏 > 𝑡 . This raisesthe prospect that a mistake today can cause irreversible damage to the principal’s average utility.By definition, the principal will regret that mistake. This would make the principal’s probleminfeasible, in that he cannot guarantee low regret for himself.Generally-speaking, regret bounds can bypass this problem if they restrict how much the agent’sresponse 𝑟 𝑡 depends on the policy history 𝑝 𝑡 −1 . This is reasonable a priori, since the policy history37 𝑡 −1 appears irrelevant to the agent’s problem. It neither affects nor predicts the state 𝑦 𝑡 , exceptthrough its dependence on 𝑦 𝑡 −1 . It is not needed as a predictor of the policy 𝑝 𝑡 because the agentobserves 𝑝 𝑡 before choosing a response. For these reasons, it seems that the agent can only makeherself worse off by allowing irrelevant variables like 𝑝 𝑡 −1 to affect her response 𝑟 𝑡 . This would betrue if our notion of good performance overall had clear implications for behavior in each period, sothat unecessary variation in behavior implies a departure from optimality. Unfortunately, there arevarious kinds of behavior that obtain low ER. The agent can easily switch between these behaviorswhile still satisfying no-ER. In the process, she can cause substantial benefit or harm to the principal.To clarify the problem, we present several examples of learners that we regard as pathological.These are implicit counterexamples to the proposition that no-ER constraints are sufficient for no-regret mechanism design.Our counterexamples are closely related to the pathological phenomenon of “superefficiency” instatistics. Suppose we are trying to estimate the mean 𝜃 of the normal random variable 𝑋 ∼ 𝑁 ( 𝜃, ,given an i.i.d. random sample 𝑋 , … , 𝑋 𝑛 . Our objective is to minimize the mean square error, butthis depends on the parameter 𝜃 . A typical solution is the maximum likelihood estimator (MLE),which in this case outputs the sample mean 𝑛 −1 ∑ 𝑛𝑖 =1 𝑋 𝑖 . For reasons that are unimportant to ourdiscussion, MLE is considered “efficient”. However, it is easy to find an estimator that outperformsMLE. For example, a wild-ass guess (WAG) ignores the data and outputs 𝜃 ∗ . If it happens to be thecase that 𝜃 = 𝜃 ∗ then this estimator is optimal.In the following example, we construct a learner that alternates between a WAG-like predictorand a MLE-like predictor depending on a seemingly irrelevant choice by the principal. Example 3 (Selective Superefficiency) . Consider a learner 𝐿 that is capable – either by ingenuityor dumb luck – of predicting the state sequence 𝑦 𝑇 perfectly. However, the learner uses this abilityonly selectively, depending on the state 𝑦 and policy 𝑝 in the first period. Despite this seeminglyirrational behavior, the learner satisfies vanishing external regret.Let 𝑃 ⊊ be a nonempty subset of policies. Let 𝑌 ⊊ be a nonempty subset of states. Let 𝑟 ∗ be the best-in-hindsight response by time 𝑇 . That is, consider some 𝑟 ∗ ∈ that happens to bebest-in-hindsight given the realized state sequence 𝑦 𝑇 but will not be best-in-hindsight uniformlyover all state sequences. Given 𝑦 𝑇 , define the learner 𝐿 as follows:1. If 𝑦 ∈ 𝑌 and 𝑝 ∈ 𝑃 then use response 𝑟 ∗
2. If 𝑦 ∈ 𝑌 and 𝑝 ∉ 𝑃 then use the response that happens to be optimal given 𝑦 𝑡 .3. If 𝑦 ∉ 𝑌 and 𝑝 ∈ 𝑃 then use the response that happens to be optimal given 𝑦 𝑡 .4. If 𝑦 ∉ 𝑌 and 𝑝 ∉ 𝑃 then use response 𝑟 ∗ For example, if the agent were Bayesian then there would be no dependence on 𝑝 𝑡 at all. If the agent used theexponential weights algorithm then there would only be an indirect dependence, since that algorithm depends on theagent’s historical payoffs and these, in turn, depend on the policy history. n cases 1 and 4, the learner follows the best-in-hindsight response and therefore achieves zeroregret. In cases 2 and 3, the learner acts optimally ex post and therefore achieves non-positiveregret.Nonetheless, no mechanism can guarantee no-regret for the principal. Suppose the mechanismchooses 𝑝 ∈ 𝑃 . If it turns out that 𝑦 ∈ 𝑌 then the agent will follow 𝜋 ∗ . Otherwise, the agent willbe superefficient. Were the mechanism to deviate to 𝑝 ∉ 𝑃 , the situation would be reversed. Theseconstitute permanent changes in the agent’s behavior. Suppose one type of behavior is “better”for the principal than another. It is always possible in hindsight that the mechanism’s first-periodpolicy was the one that led to the “worse” type of behavior. Can further assumptions rule out this kind of behavior? Again, consider the analogy with statis-tics. The WAG estimator – always predict 𝜃 ∗ – will perform very poorly in the counterfactual worldwhere 𝜃 ∗ ≠ 𝜃 . Formally, this estimator is not “consistent”. Similarly, the learner from example 3does not guarantee vanishing average external regret under counterfactual state sequences. Thisreflects a peculiar unresponsiveness to the data.Unfortunately, imposing consistency or no-regret on all sequences does not rule out these kindsof pathologies. Consider Hodge’s (superefficient) estimator, which outputs 𝜃 ∗ unless there is suf-ficient evidence that 𝜃 ≠ 𝜃 ∗ . In that case, it outputs the sample mean. If “sufficient evidence” isdefined carefully, this estimator will outperform MLE when 𝜃 = 𝜃 ∗ and will asympotically matchMLE otherwise. We can patch up example 3 in a similar way. Example 4 (Selective Superefficiency, revised) . We want to modify the learner in example 3 toensure no-regret on all counterfactual state sequences ̃𝑦 𝑇 . This is straightforward. In period 𝑡 +1 ,if the history ̃𝑦 𝑡 matches the presumed sequence 𝑦 𝑡 exactly, then proceed as before. Otherwise,follow any no-ER algorithm. This guarantees vanishing average regret as long as the regret fromthe first period 𝑡 where ̃𝑦 𝑡 ≠ 𝑦 𝑡 is bounded – which it is, since our utility functions are bounded.Therefore, for any sequence 𝑦 𝑡 there is a learner that satisfies no-regret on all sequences, butexhibits the pathological behavior from example 3 on the realized sequence. Statisticians deal with superefficiency by arguing that it generically fails to occur. That is,any alternative estimator will weakly underperform MLE on Lebesgue-almost all values 𝜃 . Forexample, we can view Hodge’s estimator as asymptotically equivalent to MLE whenever 𝜃 ≠ 𝜃 ∗ .In our setting, attempting this argument would necessitate a definition of genericity for sequencesof states. While we can provide various definitions, none seem especially compelling. Another natural restriction to impose is that the learner should not outperform the experts. Thatis, given the sequence 𝑦 𝑇 , we only consider learners whose regret at period 𝑇 is non-negative.Clearly, this rules out the learners in examples 3 and 4, which may obtain negative regret in thesequences where it predicts the state perfectly and acts on that information. Unfortunately, thisdoes not rule out the broader phenomenon, as the following example illustrates. For example, one could assign equal measure to each sequence 𝑦 𝑇 in the set 𝑇 of all possible sequences. Bythis measure, the measure of any constant sequence ( 𝑦, … , 𝑦 ) would converge to zero as 𝑇 → ∞ . Yet it does not seemunreasonable a priori that the world should persist in a fixed state. Alternatively, one could assign equal measure to allpermutations of a given sequence 𝑦 𝑇 . This would effectively return us to an i.i.d. setting. xample 5 (Selective Superinefficiency) . As in example 3, define a learner 𝐿 that appears capableof predicting the state sequence 𝑦 𝑇 perfectly. This learner will continue to use this ability selec-tively. Moreover, when the learner uses this ability, she does not always use it to her advantage.Instead, with probability 𝑞 she uses it to her own disadvantage. When 𝑞 is chosen correctly, thelearner satisfies zero regret for all mechanisms.Let 𝑃 ⊊ be a nonempty subset of policies. Let 𝑌 ⊊ be a nonempty subset of states. Let 𝑟 ∗ be the best-in-hindsight response by time 𝑇 . Let 𝑟 † 𝑡 be the response that happens to be optimal for 𝑦 𝑡 . Let ̃𝑟 𝑡 be the response that minimizes the agent’s utility when the state is 𝑦 𝑡 . Given 𝑦 𝑇 , definethe learner 𝐿 as follows:1. If 𝑦 ∈ 𝑌 and 𝑝 ∈ 𝑃 then use response 𝑟 ∗
2. If 𝑦 ∈ 𝑌 and 𝑝 ∉ 𝑃 then use 𝑟 † 𝑡 with probability 𝑞 and ̃𝑟 𝑡 with probability 𝑞
3. If 𝑦 ∉ 𝑌 and 𝑝 ∈ 𝑃 then use 𝑟 † 𝑡 with probability 𝑞 and ̃𝑟 𝑡 with probability 𝑞
4. If 𝑦 ∉ 𝑌 and 𝑝 ∉ 𝑃 then use prior 𝜋 ∗ In cases 1 and 4, the learner follows the best-in-hindsight prior and therefore achieves zero regret.In cases 2 and 3, as long as ̃𝑟 𝑡 underperforms 𝑟 ∗ in every period 𝑡 , by continuity there exists aprobability 𝑞 such that the agent achieves zero regret.This learner now satisfies both an upper bound and a lower bound on regret. Nonetheless, ourdifficulties remain. Just as in example 3, the first-period policy can cause permanent changes inthe agent’s behavior. It is always possible in hindsight that the mechanism’s first-period policy wasthe one that led to the “worse” type of behavior. We can use this example to prove proposition ?? (proposition ?? is a corollary). In the Bayesianpersuasion example, let 𝑦 𝑇 be drawn i.i.d. where the defendant is guilty with probability 𝑞 =0 . 𝜖 for a very small 𝜖 > . If the principal chooses 𝑝 correctly, he can persuade the agent toconvict with probability near one. Otherwise, the agent convicts with probability near 0.5.In the contract theory example, let 𝑦 𝑇 be be drawn i.i.d. from some distribution where theprincipal would find it optimal to pay the agent in the stage game, but both states occur with pos-itive probability. If the principal chooses 𝑝 correctly, he can pay the agent her cost of effort andachieve the first-best outcome (agent works iff working is effective). Otherwise, the principal hasto compensate the agent for her cost of effort in states where working is ineffective.The fundamental problem with the learners in examples 3, 4, 5 is not that they are well-informed.After all, in some settings we might reasonably expect the agent to be better informed than the ana-lyst. The problem is that they fail to consistently and fully exploit the private information that theyclearly possess. Bounds on counterfactual internal regret capture this failure to exploit informationand rule out these kinds of pathological behaviors. Example 6.
Returning to example 3, consider two constant mechanisms 𝜎 𝑝 and 𝜎 ̃𝑝 where 𝑝 ∈ 𝑃 and ̃𝑝 ∉ 𝑃 . Regardless of the state sequence 𝑦 𝑇 , exactly one of these mechanisms (say 𝜎 𝑝 ) will ause the agent to predict the state perfectly while the other will cause the agent to follow the best-in-hindsight prior. The agent’s behavior 𝑟 𝑝 following 𝜎 𝑝 will differ across periods 𝑦 𝑡 , 𝑦 𝜏 if and onlyif 𝑦 𝑡 ≠ 𝑦 𝜏 , while the behavior 𝑟 ̃𝑝 following 𝜎 ̃𝑝 remains constant throughout. The vector ( 𝑟 𝑝 , 𝑟 ̃𝑝 ) willtherefore differ across periods 𝑦 𝑡 , 𝑦 𝜏 if and only if 𝑦 𝑡 ≠ 𝑦 𝜏 .If we require the agent to have no-contextual regret where the context is ( 𝑟 𝑝 , 𝑟 ̃𝑝 ) , it is equivalentto requiring her to predict the state perfectly even if the principal uses 𝜎 ̃𝑝 . This is essentially thecontext used to define CIR. The learner guarantees no-CIR under mechanism 𝜎 𝑝 , because it predictsthe state perfectly. However, it does not predict the state perfectly under mechanism 𝜎 ̃𝑝 , so in thiscase the agent accumulates CIR. E.2 Proof of Lemmas ?? and ?? and Additional Results
The lemmas in this section will be used repeatedly in the proofs of theorems 1, 2, and 3.
E.2.1 Proof of Lemmas ?? and ??
Lemma ?? states that for any policy 𝑝 , information structure 𝛾 , constants 𝜖, ̃𝜖 > , and distribution 𝜋 , we have 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 + ̃𝜖 ) ≥ 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) − ̃𝜖𝜖 and 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 + ̃𝜖 ) ≤ 𝛽 𝑝 ( 𝜋, 𝛾 , 𝜖 ) + ̃𝜖𝜖 Note that lemma ?? is just a special case where the information structure 𝛾 is uninformative. Toprove this, define 𝐵 ( 𝜋, 𝛾 , 𝜖 ) = { 𝜇 𝐽 ∈ Δ( ) ∣ 𝜖 ≥ max ̃𝑟 𝐽 ∈ E 𝑦 ∼ 𝜋 [ E 𝐽 ∼ 𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( ̃𝑟 𝐽 , 𝑝, 𝑦 ) − E 𝑟 ∼ 𝜇 𝐽 [ 𝑈 ( 𝑟, 𝑝, 𝑦 )] ]]} and recall that 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) = min 𝜇 𝐽 ∈ 𝐵 ( 𝜋,𝛾,𝜖 ) E 𝑦 ∼ 𝜋 [ E 𝐽 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 𝐽 [ 𝑉 ( 𝑟, 𝑝, 𝑦 )] ]] Note that 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) is decreasing and convex in 𝜖 . Convexity follows from the fact that 𝜇 𝐽 ∈ 𝐵 ( 𝜋, 𝛾 , 𝜖 ) and ̃𝜇 𝐽 ∈ 𝐵 ( 𝜋, 𝛾 , ̃𝜖 ) implies 𝜆𝜇 𝐽 + (1 − 𝜆 ) ̃𝜇 𝐽 ∈ 𝐵 ( 𝜋, 𝛾 , 𝜆𝜖 + (1 − 𝜆 ) ̃𝜖 ) . Therefore, 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜆𝜖 + (1 − 𝜆 ) ̃𝜖 ) ≤ 𝜆𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) + (1 − 𝜆 ) 𝛼 𝑝 ( 𝜋, 𝛾 , ̃𝜖 ) Consider any supporting line of 𝛼 𝑝 at 𝜖 . It is bounded above by 𝛼 𝑝 , by definition. Therefore, itsslope is at most 𝛼 𝑝 ( 𝜋, 𝛾 ,
0) − 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) 𝜖 ≤ 𝜖 since 𝛼 𝑝 is bounded in the unit interval by our regularity assumption. Therefore, the supporting linewill underestimate 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 + ̃𝜖 ) by at most ̃𝜖 ∕ 𝜖 and at least zero. This implies our bound. Theargument for 𝛽 𝑝 is analogous after we observe that it is increasing and concave in 𝜖 .41 .2.2 Bounds for Misspecified Distributions The following lemma states that the principal’s worst-case utility 𝛼 𝑝 is not too sensitive to changesin the distribution, for any fixed policy 𝑝 . Lemma 4.
For any policy 𝑝 , information structure 𝛾 , constant 𝜖 > , and distributions 𝜋, ̃𝜋 , wehave 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) ≥ 𝛼 𝑝 ( ̃𝜋, 𝛾 , 𝜖 ) − 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 − 𝑑 ( 𝜋, ̃𝜋 ) Proof.
Note that 𝐵 ( 𝜋, 𝛾 , 𝜖 ) ⊆ 𝐵 ( ̃𝜋, 𝛾 , 𝜖 + 2 𝑑 ( 𝜋, ̃𝜋 )) since (1) for any ̃𝑟 𝐽 , E 𝑦 ∼ 𝜋 [ E 𝐽 ∼ 𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( ̃𝑟 𝐽 , 𝑝, 𝑦 ) ]] ≥ E 𝑦 ∼ ̃𝜋 [ E 𝐽 ∼ 𝛾 ( ⋅ ,𝑦 ) [ 𝑈 ( ̃𝑟 𝐽 , 𝑝, 𝑦 ) ]] − 𝑑 ( 𝜋, ̃𝜋 ) and (2), for any 𝜇 𝐽 , E 𝑦 ∼ 𝜋 [ E 𝐽 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 ∗ 𝐽 [ 𝑈 ( 𝑟, 𝑝, 𝑦 )] ]] ≤ E 𝑦 ∼ ̃𝜋 [ E 𝐽 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 ∗ 𝐽 [ 𝑈 ( 𝑟, 𝑝, 𝑦 )] ]] + 𝑑 ( 𝜋, ̃𝜋 ) Therefore, 𝛼 𝑝 ( 𝜋, 𝛾 , 𝜖 ) ≥ min 𝜇 𝐽 ∈ 𝐵 ( ̃𝜋,𝛾,𝜖 +2 𝑑 ( 𝜋, ̃𝜋 )) E 𝑦 ∼ 𝜋 [ E 𝐽 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 𝐽 [ 𝑉 ( 𝑟, 𝑝, 𝑦 )] ]] ≥ min 𝜇 𝐽 ∈ 𝐵 ( ̃𝜋,𝛾,𝜖 +2 𝑑 ( 𝜋, ̃𝜋 )) E 𝑦 ∼ ̃𝜋 [ E 𝐽 ∼ 𝛾 ( ⋅ ,𝑦 ) [ E 𝑟 ∼ 𝜇 𝐽 [ 𝑉 ( 𝑟, 𝑝, 𝑦 )] ]] − 𝑑 ( 𝜋, ̃𝜋 )= 𝛼 𝑝 ( ̃𝜋, 𝛾 , 𝜖 + 2 𝑑 ( 𝜋, ̃𝜋 )) − 𝑑 ( 𝜋, ̃𝜋 ) ≥ 𝛼 𝑝 ( ̃𝜋, 𝛾 , 𝜖 ) − 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 − 𝑑 ( 𝜋, ̃𝜋 ) The following lemma states that the 𝜖 -robust policy for a distribution ̃𝜋 that is near the truedistribution 𝜋 will perform almost as well as the 𝜖 -robust policy for the true distribution 𝜋 . Lemma 5.
For any 𝜖 > and distributions 𝜋, ̃𝜋 , we have 𝛼 𝑝 ∗ ( ̃𝜋,𝜖 ) ( 𝜋, 𝜖 ) ≥ 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝜖 ) − 4 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 − 2 𝑑 ( 𝜋, ̃𝜋 ) Proof.
First, observe that 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( 𝜋, 𝜖 ) ≤ 𝛼 𝑝 ∗ ( 𝜋,𝜖 ) ( ̃𝜋, 𝜖 ) + 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 𝑑 ( 𝜋, ̃𝜋 ) ≤ 𝛼 𝑝 ∗ ( ̃𝜋,𝜖 ) ( ̃𝜋, 𝜖 ) + 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 𝑑 ( 𝜋, ̃𝜋 ) Next, observe that 𝛼 𝑝 ∗ ( ̃𝜋,𝜖 ) ( ̃𝜋, 𝜖 ) ≤ 𝛼 𝑝 ∗ ( ̃𝜋,𝜖 ) ( 𝜋, 𝜖 ) + 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 -informationally-robust policy for a distribution ̃𝜋 thatis near the true distribution 𝜋 will provide a similar guarantee against the worst-case informationstructure 𝛾 as the 𝜖 -informationally-robust policy for the true distribution 𝜋 . Lemma 6.
For any 𝜖 > and distributions 𝜋, ̃𝜋 , we have inf 𝛾 𝛼 𝑝 † ( 𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) ≤ inf 𝛾 𝛼 𝑝 † ( ̃𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) + 4 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 2 𝑑 ( 𝜋, ̃𝜋 ) Proof.
First, observe that 𝛼 𝑝 † ( 𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) ≤ 𝛼 𝑝 † ( 𝜋,𝜖 ) ( ̃𝜋, 𝛾 , 𝜖 ) + 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 𝑑 ( 𝜋, ̃𝜋 ) which implies inf 𝛾 𝛼 𝑝 † ( 𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) ≤ inf 𝛾 𝛼 𝑝 † ( 𝜋,𝜖 ) ( ̃𝜋, 𝛾 , 𝜖 ) + 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 𝑑 ( 𝜋, ̃𝜋 ) ≤ inf 𝛾 𝛼 𝑝 † ( ̃𝜋,𝜖 ) ( ̃𝜋, 𝛾 , 𝜖 ) + 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 𝑑 ( 𝜋, ̃𝜋 ) Next, observe that 𝛼 𝑝 † ( ̃𝜋,𝜖 ) ( ̃𝜋, 𝛾 , 𝜖 ) ≤ 𝛼 𝑝 † ( ̃𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) + 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 𝑑 ( 𝜋, ̃𝜋 ) which implies inf 𝛾 𝛼 𝑝 † ( ̃𝜋,𝜖 ) ( ̃𝜋, 𝛾 , 𝜖 ) ≤ inf 𝛾 𝛼 𝑝 † ( ̃𝜋,𝜖 ) ( 𝜋, 𝛾 , 𝜖 ) + 2 𝑑 ( 𝜋, ̃𝜋 ) 𝜖 + 𝑑 ( 𝜋, ̃𝜋 ) Collapse these inequalities to obtain the desired result.
E.3 Proof of Theorem 4
Assume access to a forecast 𝜋 𝑡 ∈ Δ( ) for every period 𝑡 . We will define this later. In period 𝑡 ,the mechanism computes the policy 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) that maximizes the worst-case payoff in the ̄𝜖 -robuststage game, treating the forecast 𝜋 𝑡 as the common prior. That is, 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ∈ arg max 𝑝 ∈ 𝛼 𝑝 ( 𝜋 𝑡 , ̄𝜖 ) The mechanism chooses 𝑝 𝑡 as follows. Let 𝑃 be the unique policy context that includes the policy 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ∈ 𝑃 . Let 𝑝 𝑡 = 𝑝 𝑃 , where 𝑝 𝑃 ∈ is the representative element of 𝑃 .43e will refer to the average regret accumulated in each forecast context 𝐹 , i.e. 𝜖 𝐹 = max 𝑟 ∈ 𝑛 𝐹 ∑ 𝑡 ∈ 𝐹 ( 𝑈 ( 𝑟, 𝑝 𝐹 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝐹 , 𝑦 𝑡 ) ) where 𝑝 𝐹 = 𝑝 𝑃 for the unique policy context 𝑃 associated with forecast context 𝐹 . We will alsorefer to the average regret accumulated in each information context 𝐼 ∈ , i.e. 𝜖 𝐼 = max 𝑟 ∈ 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝐼 , 𝑦 𝑡 ) ) where 𝑝 𝐼 = 𝑝 𝐹 for the unique forecast context 𝐹 associated with information 𝐼 . Note that 𝜖 𝐹 = 1 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝜖 𝐼 = 1 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 max 𝑟 ∈ 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 𝐹 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝐹 , 𝑦 𝑡 ) ) The next two lemmas imply an upper bound on the principal’s regret in terms of the quantity 𝜄 ≥ 𝑇 𝑇 ∑ 𝑡 =1 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐼 ) that measures the discrepancy between the forecast 𝜋 𝑡 and the empirical distribution ̂𝜋 𝐼 conditionedon the agent’s information 𝐼 . Lemma 7 is a lower bound on the principal’s payoff under 𝜎 ∗ . Lemma8 is an upper bound on the his payoff under any constant 𝜎 𝑝 ∈ Σ . Lemma 7.
Suppose the principal runs 𝜎 ∗ . Then 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ≥ max 𝑝 ∈ 𝛼 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) − ( 𝜖 + 4 𝜄 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − ( 𝜄 + 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) Proof.
Let 𝑟 𝐼 be a representative element in the response context 𝑅 associated with information 𝐼 under mechanism 𝜎 ∗ . By regularity, 𝜖 𝐼 ≥ max 𝑟 ∈ 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 𝑡 ) ) − 𝐾 𝑈 𝛿 = max 𝑟 ∈ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 ) − 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] − 𝐾 𝑈 𝛿 It follows, by regularity and definition of 𝛼 , that 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝 𝐼 , 𝑦 𝑡 ) ≥ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑉 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] − 𝐾 𝑉 𝛿 ≥ 𝛼 𝑝 𝐼 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 𝐼 ∈ and using lemma 1, we obtain 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝 𝐼 , 𝑦 𝑡 ) ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 𝐼 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝛼 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − 𝜖 𝐼 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − ( 𝜖 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 Focus on the first term, i.e. 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝛼 𝑝 ∗ ( ̂𝜋 𝐼 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − 4 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐼 ) ̄𝜖 − 2 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐼 ) ) ≥ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 𝛼 𝑝 ∗ ( ̂𝜋 𝐼 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − 4 𝜄̄𝜖 − 2 𝜄 Collapsing these inequalities gives us the desired bound.
Lemma 8.
Suppose the principal runs some constant mechanism 𝜎 𝑝 ∈ Σ . Then 𝑇 𝑇 ∑ 𝑡 =1 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 max ̃𝑝 ∈ 𝛽 ̃𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + ( 𝜖 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 Proof.
Let 𝑟 𝐼 be a representative element in the response context 𝑅 associated with information 𝐼 under 𝜎 𝑝 . By regularity, 𝜖 𝐼 ≥ max 𝑟 ∈ 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 𝑡 ) ) − 𝐾 𝑈 𝛿 = max 𝑟 ∈ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 ) − 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] − 𝐾 𝑈 𝛿 It follows, by regularity and definition of 𝛽 , that 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑉 ( 𝑟 𝐼 , 𝑝, 𝑦 ) ] + 𝐾 𝑉 𝛿 ≤ 𝛼 𝑝 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) + 𝐾 𝑉 𝛿 𝐼 ∈ and using lemma 1, we obtain 𝑇 ∑ 𝐼 ∈ ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 𝛽 𝑝 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) + 𝐾 𝑉 𝛿 ≤ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 ( 𝛽 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + 𝜖 𝐼 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 ≤ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 𝛽 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + ( 𝜖 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 ≤ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 max ̃𝑝 ∈ 𝛽 ̃𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + ( 𝜖 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 This is the desired bound.From these two lemmas, it immediately follows that PR ≤ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 Δ( ̂𝜋 𝐼 , ̄𝜖 ) + 2 ( 𝜖 + 2 𝜄 + 𝐾 𝑈 𝛿 + 𝐾 𝑈 𝛿 ̄𝜖 ) + ( 𝜄 + 2 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) Therefore, to bound the principal’s regret, all that remains is to bound 𝜄 .Set = 𝑆 | | + | | where 𝐶 𝑡 ( ) is the vector describing the agent’s response contexts 𝑅 𝑡 underpolicy history 𝑝 ∗1∶ 𝑡 −1 and policy choices 𝑝 𝑡 ∈ . Note that this is different from the behavior contextthat we used to define the agent’s information, which refers to the response context 𝑅 𝑡 under policyhistory 𝑝 ∗1∶ 𝑡 . Because we are currently designing the mechanism, we cannot refer to 𝑝 ∗ 𝑡 withoutattempting to solve a fixed point problem that may not have a solution.We use the algorithm from appendix ?? to generate 𝜋 𝑡 , with a modification: run it separatelyfor each context 𝐶 𝑡 . Adapting equation (8), we obtain E 𝜎 ∗ [ 𝑇 ∑ 𝐶 ∈ ∑ 𝐹 ∈ 𝑛 𝐹 ,𝐶 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 ) ] ≤ 𝑇 ∑ 𝐶 ∈ 𝑛 𝐶 √√√√| || |√ | | 𝑛 𝐶 + 2 | | 𝛿 ≤ 𝑇 ∑ 𝐶 ∈ 𝑛 𝐶 √| || |√ | | + √ | | 𝛿 ≤ 𝑇 ∑ 𝐶 ∈ ( 𝑇 | | ) √| || |√ | | + √ | | 𝛿 = ( | | 𝑇 ) √| || |√ | | + √ | | 𝛿 where 𝑛 𝐹 ,𝐶 is the number of periods 𝑡 where 𝐶 𝑡 = 𝐶 and 𝜋 𝑡 ∈ 𝐹 .Consider any two periods 𝑡, 𝜏 where 𝐼 𝑡 = 𝐼 𝜏 but 𝐶 𝑡 ≠ 𝐶 𝜏 . Since 𝐼 𝑡 = 𝐼 𝜏 and information46ncludes the forecast as context, we know that 𝜋 𝑡 = 𝜋 𝜏 . Now, consider 𝑛 𝐹 𝑡 ,𝐶 𝑡 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 𝑡 ) + 𝑛 𝐹 𝜏 ,𝐶 𝜏 𝑑 ( 𝜋 𝜏 , ̂𝜋 𝐶 𝜏 )= 𝑛 𝐹 𝑡 ,𝐶 𝑡 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 𝑡 ) + 𝑛 𝐹 𝑡 ,𝐶 𝜏 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 𝜏 ) ≥ ( 𝑛 𝐹 𝑡 ,𝐶 𝑡 + 𝑛 𝐹 𝑡 ,𝐶 𝜏 ) 𝑑 ( 𝜋 𝑡 , 𝑛 𝐹 𝑡 ,𝐶 𝑡 + 𝑛 𝐹 𝑡 ,𝐶 𝜏 ( 𝑛 𝐹 𝑡 ,𝐶 𝑡 ̂𝜋 𝐶 𝑡 + 𝑛 𝐹 𝑡 ,𝐶 𝜏 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 𝜏 ) )) by subadditivity and homogeneity of norms. By continuing this process of combining contexts, wefind 𝑇 ∑ 𝐶 ∈ ∑ 𝐹 ∈ 𝑛 𝐹 ,𝐶 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 ) ≥ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐼 ) = 𝜄 Therefore, the earlier miscalibration bound applies to E 𝜎 ∗ [ 𝜄 ] as well. Finally, we obtain our boundon the expected principal’s regret. E 𝜎 ∗ [PR] ≤ 𝑇 ∑ 𝐼 ∈ 𝑛 𝐼 Δ( ̂𝜋 𝐼 , ̄𝜖 ) + 2 ⎛⎜⎜⎜⎝ 𝜖 + 2 ( | | 𝑇 ) √| || |√ | | + 2 √ | | 𝛿 + 𝐾 𝑈 𝛿 + 𝐾 𝑈 𝛿 ̄𝜖 ⎞⎟⎟⎟⎠ + ( ( | | 𝑇 ) √| || |√ | | + 2 √ | | 𝛿 + 2 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) E.4 Proof of Theorem 5
Define ̂𝛾 𝑃 ( 𝐼 , 𝑦 ) = 𝑛 𝐼 ̂𝜋 𝐼 ( 𝑦 ) 𝑛 𝐹 ̂𝜋 𝐹 ( 𝑦 ) ⋅ ( 𝐼 ∈ 𝐹 ) as the empirical information structure conditional on forecast context 𝐹 . This definition followsfrom Bayes’ rule.Assume access to a forecast 𝜋 𝑡 ∈ Δ( ) for every period 𝑡 . We will define this later. In period 𝑡 ,the mechanism computes the policy 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) that maximizes the worst-case payoff in the ̄𝜖 -robuststage game, treating the forecast 𝜋 𝑡 as the common prior. That is, 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ∈ arg max 𝑝 ∈ 𝛼 𝑝 ( 𝜋 𝑡 , ̄𝜖 ) The mechanism chooses 𝑝 𝑡 as follows. Let 𝑃 be the unique policy context that includes the policy 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ∈ 𝑃 . Let 𝑝 𝑡 = 𝑝 𝑃 , where 𝑝 𝑃 ∈ is the representative element of 𝑃 .The next two lemmas imply an upper bound on the principal’s regret in terms of the quantity 𝜄 ≥ 𝑇 𝑇 ∑ 𝑡 =1 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐹 ) 𝜋 𝑡 and the empirical distribution ̂𝜋 𝐹 conditionedon the forecast context 𝐹 . Lemma 9 is a lower bound on the principal’s payoff under 𝜎 ∗ . Lemma10 is an upper bound on the his payoff under any constant 𝜎 𝑝 ∈ Σ . Lemma 9.
Suppose the principal runs the mechanism 𝜎 ∗ . Then 𝑇 𝑇 ∑ 𝑡 =1 𝑉 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ≥ 𝑇 ∑ 𝐹 ∈ 𝑛 𝐹 max 𝑝 ∈ 𝛼 𝑝 ( ̂𝜋 𝐹 , ̄𝜖 ) − ( 𝜖 + 6 𝜄 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − ( 𝑀 ( 𝜖 + ̃𝜖 + 2 𝜄 + 2 𝐾 𝑈 𝛿 ) + 𝑀 + 3 𝜄 + 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) Proof.
Let 𝑟 𝐼 be a representative element in the response context 𝑅 associated with information 𝐼 under mechanism 𝜎 ∗ . By regularity, 𝜖 𝐼 ≥ max 𝑟 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 𝑡 ) ) − 𝐾 𝑈 𝛿 = max 𝑟 ∈ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 ) − 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] − 𝐾 𝑈 𝛿 It follows, by regularity and definition of 𝛼 , that 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝 𝐼 , 𝑦 𝑡 ) ≥ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑉 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] − 𝐾 𝑉 𝛿 ≥ 𝛼 𝑝 𝐼 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 Summing over information 𝐼 ∈ 𝐹 and using lemma 1, we obtain 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝 𝐹 , 𝑦 𝑡 ) ≥ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 𝐹 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 ≥ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 ≥ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 ( 𝛼 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − 𝜖 𝐼 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 = 1 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 ∗ ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − ( 𝜖 𝐹 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 So far, we have a lower bound for the principal’s payoff that nearly matches the principal’s worst-case payoff in the stage game if the agent had information structure ̂𝛾 𝐹 in each forecast context.Furthermore, we know that this information structure cannot be particularly useful to the agent.Define − ̃𝜖 𝐹 = max 𝑟 ∈ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 𝐹 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝 𝐹 , 𝑦 𝑡 ) ) as the (possibly negative) regret accumulated in forecast context 𝐹 relative to the best-in-hindsight48esponse, rather than the best-in-hindsight function from information to responses. Let 𝜋 𝐹 be theforecast associated with forecast context 𝐹 . Note that 𝜖 𝐹 + ̃𝜖 𝐹 = min ̃𝑟 ∈ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 max 𝑟 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 𝐹 , 𝑦 𝑡 ) − 𝑈 ( ̃𝑟, 𝑝 𝐹 , 𝑦 𝑡 ) ) ≥ min ̃𝑟 ∈ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 max 𝑟 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) , 𝑦 𝑡 ) − 𝑈 ( ̃𝑟, 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) , 𝑦 𝑡 ) ) − 2 𝐾 𝑈 𝛿 = min 𝑟 max 𝑟 𝐽 E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 𝐹 ( ⋅ ,𝑦 ) [ 𝑈 ( 𝑟 𝐽 , 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) , 𝑦 ) − 𝑈 ( 𝑟, 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) , 𝑦 ) ]] − 2 𝐾 𝑈 𝛿 ≥ min 𝑟 max 𝑟 𝐽 E 𝑦 ∼ 𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 𝐹 ( ⋅ ,𝑦 ) [ 𝑈 ( 𝑟 𝐽 , 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) , 𝑦 ) − 𝑈 ( 𝑟, 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) , 𝑦 ) ]] − 2 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) − 2 𝐾 𝑈 𝛿 It follows from assumption 9 that 𝑀 ( 𝜖 𝐹 + ̃𝜖 𝐹 + 2 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 2 𝐾 𝑈 𝛿 ) + 𝑀 ≥ 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐹 , ̄𝜖 ) − 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) which can be rewritten as 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) ≥ 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐹 , ̄𝜖 ) − ( 𝑀 ( 𝜖 𝐹 + ̃𝜖 𝐹 + 2 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 2 𝐾 𝑈 𝛿 ) + 𝑀 ) (9)Next, we relate our lower bound on the principal’s payoff to the term 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) . Note that 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) = E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ( ⋅ ,𝑦 ) [ 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 ∣ 𝐽 , ̄𝜖 ) ]] ≥ min 𝑒 𝐽 E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ( ⋅ ,𝑦 ) [ 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐹 ∣ 𝐽 , 𝑒 𝐽 ) ]] s . t . ̄𝜖 = E 𝑦 ∼ 𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ⋅ ,𝑦 [ 𝑒 𝐽 ]] = 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) ≥ 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) − 2 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) ̄𝜖 − 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) So combining this with inequality (9) gives 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) ≥ 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐹 , ̄𝜖 ) − ( 𝑀 ( 𝜖 𝐹 + ̃𝜖 𝐹 + 2 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 2 𝐾 𝑈 𝛿 ) + 𝑀 ) − ( 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) ̄𝜖 ) − 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) ≥ 𝛼 𝑝 ∗ ( ̂𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 , ̄𝜖 ) − ( 𝑀 ( 𝜖 𝐹 + ̃𝜖 𝐹 + 2 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 2 𝐾 𝑈 𝛿 ) + 𝑀 ) − ( 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) ̄𝜖 ) − 3 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) 𝑛 𝐹 ∑ 𝑡 ∈ 𝐹 𝑉 ( 𝑟 𝑡 , 𝑝 𝐹 , 𝑦 𝑡 ) ≥ 𝛼 𝑝 ∗ ( ̂𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 , ̄𝜖 ) − ( 𝜖 𝐹 + 6 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − ( 𝑀 ( 𝜖 𝐹 + ̃𝜖 𝐹 + 2 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 2 𝐾 𝑈 𝛿 ) + 𝑀 + 3 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) Summing over forecast contexts 𝐹 ∈ gives us the desired result. Lemma 10.
Suppose the principal runs some constant mechanism 𝜎 𝑝 ∈ Σ . Then 𝑇 𝑇 ∑ 𝑡 =1 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ 𝑇 ∑ 𝐹 ∈ 𝑛 𝐹 max ̃𝑝 ∈ 𝛽 ̃𝑝 ( ̂𝜋 𝐹 , ̄𝜖 ) + ( 𝜖 + 𝐾 𝑈 𝛿 ̄𝜖 ) + ( 𝑀 ( 𝜖 + ̃𝜖 ) + 𝑀 + 𝐾 𝑉 𝛿 ) Proof.
Let 𝑟 𝐼 be a representative element in the response context 𝑅 associated with information 𝐼 under mechanism 𝜎 𝑝 . By regularity, 𝜖 𝐼 ≥ max 𝑟 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝, 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝐼 , 𝑝, 𝑦 𝑡 ) ) − 𝐾 𝑈 𝛿 = max 𝑟 ∈ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( 𝑟, 𝑝, 𝑦 ) − 𝑈 ( 𝑟 𝐼 , 𝑝, 𝑦 ) ] − 𝐾 𝑈 𝛿 It follows, by regularity and definition of 𝛽 , that 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑉 ( 𝑟 𝐼 , 𝑝, 𝑦 ) ] + 𝐾 𝑉 𝛿 ≤ 𝛽 𝑝 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) + 𝐾 𝑉 𝛿 Summing over information 𝐼 ∈ 𝐹 and using lemma 1, we obtain 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛽 𝑝 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) + 𝐾 𝑉 𝛿 ≤ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 ( 𝛽 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + 𝜖 𝐼 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 = 1 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛽 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + ( 𝜖 𝐹 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 So far, we have an upper bound for the principal’s payoff that nearly matches the principal’s worst-case payoff in the stage game if the agent had information structure ̂𝛾 𝐹 in each forecast context.Furthermore, we know that this information structure cannot be particularly useful to the agent.Define − ̃𝜖 𝐹 = max 𝑟 ∈ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝, 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) )
50s the (possibly negative) regret accumulated in forecast context 𝐹 relative to the best-in-hindsightresponse, rather than the best-in-hindsight function from information to responses. Let 𝜋 𝐹 be theforecast associated with forecast context 𝐹 . Note that 𝜖 𝐹 + ̃𝜖 𝐹 = min ̃𝑟 ∈ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 max 𝑟 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝, 𝑦 𝑡 ) − 𝑈 ( ̃𝑟, 𝑝, 𝑦 𝑡 ) ) It follows from assumption 9 that 𝑀 ( 𝜖 𝐹 + ̃𝜖 𝐹 ) + 𝑀 ≥ 𝛽 𝑝 ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) − max ̃𝑝 ∈ 𝛽 ̃𝑝 ( ̂𝜋 𝐹 , ̄𝜖 ) which can be rewritten as 𝛽 𝑝 ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) ≤ max ̃𝑝 ∈ 𝛽 ̃𝑝 + 𝑀 ( 𝜖 𝐹 + ̃𝜖 𝐹 ) + 𝑀 Next, we relate our upper bound on the principal’s payoff to the term 𝛽 𝑝 ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) . Note that 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛽 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) = E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ( ⋅ ,𝑦 ) [ 𝛽 𝑝 ( ̂𝜋 𝐹 ∣ 𝐽 , ̄𝜖 ) ]] ≤ max 𝑒 𝐽 E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ( ⋅ ,𝑦 ) [ 𝛽 𝑝 ( 𝜋 𝐹 ∣ 𝐽 , 𝑒 𝐽 ) ]] s . t . ̄𝜖 = E 𝑦 ∼ 𝜋 𝐹 [ E 𝐽 ∼ ̂ ⋅ ,𝑦 [ 𝑒 𝐽 ]] = 𝛽 𝑝 ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) Collapsing these inequalities gives us 𝑛 𝐹 ∑ 𝑡 ∈ 𝐹 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ max ̃𝑝 ∈ 𝛽 ̃𝑝 ( ̂𝜋 𝐹 , ̄𝜖 ) + ( 𝜖 𝐹 + 𝐾 𝑈 𝛿 ̄𝜖 ) + ( 𝑀 ( 𝜖 𝐹 + ̃𝜖 𝐹 ) + 𝑀 + 𝐾 𝑉 𝛿 ) Summing over forecast contexts 𝐹 ∈ gives us the desired result.From these two lemmas, it immediately follows that PR ≤ 𝑇 ∑ 𝐹 ∈ 𝑛 𝐹 Δ( ̂𝜋 𝐹 , ̄𝜖 ) + ( 𝜖 + 6 𝜄 + 2 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) + ( 𝑀 (2 𝜖 + 2 ̃𝜖 + 2 𝜄 + 2 𝐾 𝑈 𝛿 ) + 2 𝑀 + 3 𝜄 + 2 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) Therefore, to bound the principal’s regret, all that remains is to bound 𝜄 . If we use the algorithmfrom appendix ?? to generate 𝜋 𝑡 , this follows directly from equation (8), which states E 𝜎 ∗ [ 𝜄 ] ≤ √| || |√ | | 𝑇 + 2 | | 𝛿 E 𝜎 ∗ [PR] ≤ 𝑇 ∑ 𝐹 ∈ 𝑛 𝐹 Δ( ̂𝜋 𝐹 , ̄𝜖 ) + ⎛⎜⎜⎜⎜⎝ 𝜖 + 6 √| || |√ | | 𝑇 + 2 | | 𝛿 + 2 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ⎞⎟⎟⎟⎟⎠ + 𝑀 ⎛⎜⎜⎝ 𝜖 + 2 ̃𝜖 + 2 √| || |√ | | 𝑇 + 2 | | 𝛿 + 2 𝐾 𝑈 𝛿 ⎞⎟⎟⎠ + 2 𝑀 + 3 √| || |√ | | 𝑇 + 2 | | 𝛿 + 2 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 E.5 Proof of Theorem 6
Assume access to a forecast 𝜋 𝑡 for every period 𝑡 . We will define this later. The mechanism chooses 𝑝 𝑡 as follows. Let 𝑃 be the unique policy context that includes the policy 𝑝 † ( 𝜋 𝑡 , ̄𝜖 ) ∈ 𝑃 . Let 𝑝 𝑡 = 𝑝 𝑃 ,where 𝑝 𝑃 ∈ is the representative element of 𝑃 .The next two lemmas imply an upper bound on the principal’s regret in terms of the quantity 𝜄 ≥ 𝑇 𝑇 ∑ 𝑡 =1 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐹 ) that measures the discrepancy between the forecast 𝜋 𝑡 and the empirical distribution ̂𝜋 𝐹 conditionedon the forecast context 𝐹 . Lemma 11 is a lower bound on the principal’s payoff under 𝜎 ∗ . Lemma12 is an upper bound on the his payoff under any constant 𝜎 𝑝 ∈ Σ . Lemma 11.
Suppose the principal runs the mechanism 𝜎 ∗ . Then 𝑇 𝑇 ∑ 𝑡 =1 𝑉 ( 𝑟 𝑡 , 𝑝 𝑡 , 𝑦 𝑡 ) ≥ 𝑇 ∑ 𝐹 ∈ 𝑛 𝐹 inf 𝛾 max 𝑝 ∈ 𝛼 𝑝 ( ̂𝜋 𝐹 , 𝛾 , ̄𝜖 ) − ( 𝜖 + 4 𝜄 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − ( 𝜄 + 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) Proof.
Let 𝑟 𝐼 be a representative element in the response context 𝑅 associated with information 𝐼 under mechanism 𝜎 ∗ . By regularity, 𝜖 𝐼 ≥ max 𝑟 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 𝑡 ) ) − 𝐾 𝑈 𝛿 = max 𝑟 ∈ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( 𝑟, 𝑝 𝐼 , 𝑦 ) − 𝑈 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] − 𝐾 𝑈 𝛿
52t follows, by regularity and definition of 𝛼 , that 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝 𝐼 , 𝑦 𝑡 ) ≥ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑉 ( 𝑟 𝐼 , 𝑝 𝐼 , 𝑦 ) ] − 𝐾 𝑉 𝛿 ≥ 𝛼 𝑝 𝐼 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 Summing over information 𝐼 ∈ 𝐹 , we obtain 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝 𝐹 , 𝑦 𝑡 ) ≥ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 𝐹 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 ≥ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 † ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 ≥ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 ( 𝛼 𝑝 † ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − 𝜖 𝐼 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 = 1 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝛼 𝑝 † ( 𝜋 𝑡 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − ( 𝜖 𝐹 + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − 𝐾 𝑉 𝛿 − 𝐾 𝑉 𝛿 Focus on the first term, i.e. 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛼 𝑝 † ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) = E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ( ⋅ ,𝑦 ) [ 𝛼 𝑝 † ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 ∣ 𝐽 , ̄𝜖 ) ]] ≥ min 𝑒 𝐽 E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ( ⋅ ,𝑦 ) [ 𝛼 𝑝 † ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 ∣ 𝐽 , 𝑒 𝐽 ) ]] s . t . ̄𝜖 = E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ⋅ ,𝑦 [ 𝑒 𝐽 ]] = 𝛼 𝑝 † ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) ≥ inf 𝛾 𝛼 𝑝 † ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 , 𝛾 , ̄𝜖 ) ≥ inf 𝛾 𝛼 𝑝 † ( ̂𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 , 𝛾 , ̄𝜖 ) − 4 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) ̄𝜖 − 2 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) Collapsing these inequalities gives us 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝 𝐹 , 𝑦 𝑡 ) ≥ inf 𝛾 𝛼 𝑝 † ( ̂𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐹 , 𝛾 , ̄𝜖 ) − ( 𝜖 𝐹 + 4 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 𝐾 𝑈 𝛿 + 2 𝐾 𝑈 𝛿 ̄𝜖 ) − ( 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) + 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) Summing over forecast contexts 𝐹 ∈ gives us the desired result. Lemma 12.
Suppose the principal runs some constant mechanism 𝜎 𝑝 ∈ Σ . Then 𝑇 𝑇 ∑ 𝑡 =1 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ ∑ 𝐹 ∈ 𝑛 𝐹 max ̃𝑝 ∈ 𝛽 ̃𝑝 ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) + ( 𝜖 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 roof. Let 𝑟 𝐼 be a representative element in the response context 𝑅 associated with information 𝐼 under mechanism 𝜎 𝑝 . By regularity, 𝜖 𝐼 ≥ max 𝑟 ∈ ∑ 𝑡 ∈ 𝐼 ( 𝑈 ( 𝑟, 𝑝, 𝑦 𝑡 ) − 𝑈 ( 𝑟 𝐼 , 𝑝, 𝑦 𝑡 ) ) − 𝐾 𝑈 𝛿 = max 𝑟 ∈ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑈 ( 𝑟, 𝑝, 𝑦 ) − 𝑈 ( 𝑟 𝐼 , 𝑝, 𝑦 ) ] − 𝐾 𝑈 𝛿 It follows, by regularity and definition of 𝛽 , that 𝑛 𝐼 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ E 𝑦 ∼ ̂𝜋 𝐼 [ 𝑉 ( 𝑟 𝐼 , 𝑝, 𝑦 ) ] + 𝐾 𝑉 𝛿 ≤ 𝛽 𝑝 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) + 𝐾 𝑉 𝛿 Summing over information 𝐼 ∈ 𝐹 , we obtain 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛽 𝑝 ( ̂𝜋 𝐼 , 𝜖 𝐼 + 𝐾 𝑈 𝛿 ) + 𝐾 𝑉 𝛿 ≤ 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 ( 𝛽 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + 𝜖 𝐼 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 = 1 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛽 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) + ( 𝜖 𝐹 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 Focus on the first term, i.e. 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 𝑛 𝐼 𝛽 𝑝 ( ̂𝜋 𝐼 , ̄𝜖 ) = E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ( ⋅ ,𝑦 ) [ 𝛽 𝑝 ( ̂𝜋 𝐹 ∣ 𝐽 , ̄𝜖 ) ]] ≤ max 𝑒 𝐽 E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ( ⋅ ,𝑦 ) [ 𝛽 𝑝 ( ̂𝜋 𝐹 ∣ 𝐽 , 𝑒 𝐽 ) ]] s . t . ̄𝜖 = E 𝑦 ∼ ̂𝜋 𝐹 [ E 𝐽 ∼ ̂𝛾 ⋅ ,𝑦 [ 𝑒 𝐽 ]] = 𝛽 𝑝 ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) ≤ max ̃𝑝 ∈ 𝛽 ̃𝑝 ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) Collapsing these inequalities gives us 𝑛 𝐹 ∑ 𝐼 ∈ 𝐹 ∑ 𝑡 ∈ 𝐼 𝑉 ( 𝑟 𝑡 , 𝑝, 𝑦 𝑡 ) ≤ max ̃𝑝 ∈ 𝛽 ̃𝑝 ( ̂𝜋 𝐹 , ̂𝛾 𝐹 , ̄𝜖 ) + ( 𝜖 𝐹 + 𝐾 𝑈 𝛿 ̄𝜖 ) + 𝐾 𝑉 𝛿 Summing over forecast contexts 𝐹 ∈ gives us the desired result.54rom these two lemmas, it immediately follows that PR ≤ ∑ 𝐹 ∈ 𝑛 𝐹 ∇( ̂𝜋 𝐹 , ̄𝜖 ) + 2 ( 𝜖 + 2 𝜄 + 𝐾 𝑈 𝛿 + 𝐾 𝑈 𝛿 ̄𝜖 ) + ( 𝜄 + 2 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ) Therefore, to bound the principal’s regret, all that remains is to bound 𝜄 . If we use the algorithmfrom appendix ?? to generate 𝜋 𝑡 , this follows directly from equation (8), which states E 𝜎 ∗ [ 𝜄 ] ≤ √| || |√ | | 𝑇 + 2 | | 𝛿 Finally, we obtain our bound on the expected principal’s regret. E 𝜎 ∗ [PR] ≤ 𝑇 ∑ 𝐹 ∈ 𝑛 𝐹 ∇( ̂𝜋 𝐹 , ̄𝜖 ) + 2 ⎛⎜⎜⎜⎜⎝ 𝜖 + 2 √| || |√ | | 𝑇 + 2 | | 𝛿 + 𝐾 𝑈 𝛿 + 𝐾 𝑈 𝛿 ̄𝜖 ⎞⎟⎟⎟⎟⎠ + ⎛⎜⎜⎝ √| || |√ | | 𝑇 + 2 | | 𝛿 + 2 𝐾 𝑉 𝛿 + 𝐾 𝑉 𝛿 ⎞⎟⎟⎠ E.6 Proof of Theorem 1
We adapt the proof of theorem 4 to prove theorem 1. This will require only relatively minor changes.Let ̂𝜋 𝐼,𝐹 denote the empirical distribution among periods 𝑡 ∈ 𝐼 ∩ 𝐹 . Let 𝑛 𝐼,𝐹 indicate the number ofsuch periods. Let 𝜋 𝐹 denote the (unique) forecast associated with forecast context 𝐹 . Previously,we defined 𝜄 ≥ 𝑇 𝑇 ∑ 𝑡 =1 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐼 ) Now, we define 𝜄 ≥ 𝑇 𝑇 ∑ 𝑡 =1 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐼,𝐹 ) Begin at the last line of lemma 7, where it says “focus on the first term”. Rewrite that first term as 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐼 ∩ 𝐹 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) Now we switch ̂𝜋 𝐼 with 𝜋 𝐼 , i.e. the convex combination of forecasts, 𝜋 𝐼 = 1 𝑛 𝐼 ∑ 𝐹 ∈ 𝑛 𝐼,𝐹 𝜋 𝐹
55y lemma 4, this gives us 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐼 ∩ 𝐹 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐼 ∩ 𝐹 ( 𝛼 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) ( 𝜋 𝐼 , ̄𝜖 ) − 2 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) ̄𝜖 − 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) ) Note that every forecast 𝜋 𝐹 leads to a policy 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) that is in the policy context 𝑃 associated withinformation 𝐼 . By assumption ?? , ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐼 ∩ 𝐹 ( 𝛼 𝑝 ∗ ( 𝜋 𝐼 , ̄𝜖 ) ( 𝜋 𝐼 , ̄𝜖 ) − 2 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) ̄𝜖 − 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) − 𝑂 ( 𝛿 ) ) Now we apply lemma 4 again, ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐼 ∩ 𝐹 ( 𝛼 𝑝 ∗ ( 𝜋 𝐼 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − 4 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) ̄𝜖 − 2 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) − 𝑂 ( 𝛿 ) ) and then lemma 5 ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐼 ∩ 𝐹 ( 𝛼 𝑝 ∗ ( ̂𝜋 𝐼 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − 8 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) ̄𝜖 − 4 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) − 𝑂 ( 𝛿 ) ) By the homogeneity and subadditivity of the 𝑙 norm, 𝑑 ( 𝜋 𝐼 , ̂𝜋 𝐼 ) ≤ 𝑛 𝐼 𝑛 ∑ 𝑖 =1 𝑛 𝐼,𝐹 𝑑 ( 𝜋 𝐼,𝐹 , ̂𝜋
𝐼,𝐹 ) which gives us ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐼 ∩ 𝐹 ( 𝛼 𝑝 ∗ ( ̂𝜋 𝐼 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) − 8 𝑑 ( 𝜋 𝐼,𝐹 , ̂𝜋
𝐼,𝐹 ) ̄𝜖 − 4 𝑑 ( 𝜋 𝐼,𝐹 , ̂𝜋
𝐼,𝐹 ) − 𝑂 ( 𝛿 ) ) ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐼 ∩ 𝐹 ( 𝛼 𝑝 ∗ ( ̂𝜋 𝐼 , ̄𝜖 ) ( ̂𝜋 𝐼 , ̄𝜖 ) ) − 8 𝜄̄𝜖 − 4 𝜄 − 𝑂 ( 𝛿 ) This is essentially where we were by the end of lemma 7, with the addition of an 𝑂 ( 𝛿 ) term andslightly different constants.Lemma 8 requires no change. The discussion following lemma 8 requires very little change.Find the line that begins with “Consider any two periods”. We rewrite as follows. Consider anytwo periods 𝑡, 𝜏 where 𝐼 𝑡 = 𝐼 𝜏 and 𝐹 𝑡 = 𝐹 𝜏 but 𝐶 𝑡 ≠ 𝐶 𝜏 . Since 𝐹 𝑡 = 𝐹 𝜏 we know that 𝜋 𝑡 = 𝜋 𝜏 .Now, consider 𝑛 𝐹 𝑡 ,𝐶 𝑡 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 𝑡 ) + 𝑛 𝐹 𝜏 ,𝐶 𝜏 𝑑 ( 𝜋 𝜏 , ̂𝜋 𝐶 𝜏 )= 𝑛 𝐹 𝑡 ,𝐶 𝑡 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 𝑡 ) + 𝑛 𝐹 𝑡 ,𝐶 𝜏 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 𝜏 ) ( 𝑛 𝐹 𝑡 ,𝐶 𝑡 + 𝑛 𝐹 𝑡 ,𝐶 𝜏 ) 𝑑 ( 𝜋 𝑡 , 𝑛 𝐹 𝑡 ,𝐶 𝑡 + 𝑛 𝐹 𝑡 ,𝐶 𝜏 ( 𝑛 𝐹 𝑡 ,𝐶 𝑡 ̂𝜋 𝐶 𝑡 + 𝑛 𝐹 𝑡 ,𝐶 𝜏 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 𝜏 ) )) by subadditivity and homogeneity of norms. By continuing this process of combining contexts, wefind 𝑇 ∑ 𝐶 ∈ ∑ 𝐹 ∈ 𝑛 𝐹 ,𝐶 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐶 ) ≥ 𝑇 ∑ 𝐼 ∈ ∑ 𝐹 ∈ 𝑛 𝐼,𝐹 𝑑 ( 𝜋 𝑡 , ̂𝜋 𝐼,𝐹 ) = 𝜄 Therefore, our bound holds except with the addition of an 𝑂 ( 𝛿 ) term and slightly different con-stants. E.7 Proof of Theorem 2
Define 𝜋 𝑃 = 1 𝑛 𝑃 ∑ 𝐹 ∈ 𝑛 𝑃 ,𝐹 𝜋 𝐹 It is straightforward to adapt the proof of theorem 5. Replace all reference to 𝑝 ∗ ( 𝜋 𝐹 , ̄𝜖 ) with 𝑝 ∗ ( 𝜋 𝑃 , ̄𝜖 ) .This changes 𝑈 and 𝑉 (and all derived terms, like 𝛼 ) by at most 𝑂 ( 𝛿 ) , by assumption ?? . Replaceall remaining references of forecast contexts 𝐹 to policy contexts 𝑃 . It remains to verify that 𝜄 ≥ 𝑇 ∑ 𝐹 ∈ ∑ 𝑡 ∈ 𝐹 𝑑 ( 𝜋 𝐹 , ̂𝜋 𝐹 ) ≥ 𝑇 ∑ 𝑃 ∈ ∑ 𝑡 ∈ 𝐹 𝑑 ( 𝜋 𝑃 , ̂𝜋 𝑃 ) which follows from the homogeneity and subadditivity of the 𝑙 norm, and the fact that 𝜋 𝑃 , ̂𝜋 𝑃 areconvex combinations of 𝜋 𝐹 , ̂𝜋 𝐹 for 𝐹 ⊆ 𝑃 . E.8 Proof of Theorem 3