Machine Learning for Strategic Inference
aa r X i v : . [ ec on . T H ] J a n MACHINE LEARNING FOR STRATEGIC INFERENCE
IN-KOO CHO AND JONATHAN LIBGOBER
Abstract.
We study interactions between strategic players and markets whose behav-ior is guided by an algorithm. Algorithms use data from prior interactions and a limitedset of decision rules to prescribe actions. While as-if rational play need not emerge ifthe algorithm is constrained, it is possible to guide behavior across a rich set of possi-ble environments using limited details. Provided a condition known as weak learnability holds, Adaptive Boosting algorithms can be specified to induce behavior that is (approx-imately) as-if rational. Our analysis provides a statistical perspective on the study ofendogenous model misspecification. Introduction
The importance of algorithms in guiding economic behavior is already significant, andlikely to only be more so in the years to come. But since a number of economic phenomenarely crucially on the presence of rational individuals on both sides of a particular interac-tion, an open question is whether traditional rational models apply to such situations. Ofcourse, economists recognize that people often fail to act rationally, with certain consistentfailures having empirical implications. But the extent to which algorithms are susceptibleto errors is a separate issue, and one that should be addressed for economists to speak tothe increasing number of applications where algorithm design plays a central role.This paper introduces a framework to study the question of whether and when al-gorithms can approximate rational behavior. In our model, a rational, strategic player(who we refer to as a sender ) chooses a strategy when interacting with an algorithm thatprescribes actions to a stream of short lived actors (who we refer to as receivers ). Adistinguishing feature of our exercise is our focus on the problem of strategic inference :Specifically, we assume that the sender commits to a strategy which maps states into ac-tions, and so a rational receiver would update beliefs about the best reply after observingthe sender’s action. A rational receiver would thus make an inference regarding a payoffrelevant state using knowledge of the sender’s strategy. On the other hand, the algorithmhas access to observations on what transpired in previous interactions. We are interested
Date : January 26, 2021.We thank Juan Carrillo, Jason Hartline, Navin Kartik, Roger Moon, Xiaosheng Mu, Guofu Tan, AsheshRambachan, Joel Sobel, Erik Strand, and especially Grigory Franguridi, for helpful conversations andcomments, and seminar audiences at AMETS, Emory, the Econometric Society World Congress, theNSF/NBER/CEME Conference on Mathematical Economics, Rochester, UC-San Diego, and USC. Thisproject started when the first author was visiting the University of Southern California. We are grateful forhospitality and support from USC. Financial support from the National Science Foundation is gratefullyacknowledged. The terminology of sender and receiver is to highlight the role of our model’s timing; but many of ourapplications are beyond where these labels are traditionally applied. in comparing the rational receiver’s strategy with the strategy induced by the algorithm,with a focus on determining when rationality can be replicated. We are particularly in-terested in the case where the algorithm seeks to provide these recommendations withoutusing details of the sender’s objective or the particular setting at hand (for instance, asa sales platform might when designing an algorithm to be used for a variety of differentproducts). In other words, we are interested in finding an algorithm capable of inducingrationality under as wide a set of environments as possible.In our model, the algorithm produces recommendations using data from relevant in-teractions in the past (where data consists of sender actions and ex-post payoffs). Theserecommendations are determined by finding a best fitting decision rule. By best fitting,we mean that there is minimal error, with errors being weighted according to some spe-cific objective. The main assumption here is that the algorithm can determine the bestfitting rule from a particular set of decision rules , which we refer to as a hypothesis class (following the machine learning literature). A crucial limitation is that this hypothesisclass is restricted and must be specified in advance, so that not every feasible mappingfrom messages into actions can be fit to the data. Thus, there is no a priori guaranteethat finding the best fitting rule within the given set yields the rational reply; whetherthis property holds will depend upon the sender’s strategy.Our theoretical question can be phrased as follows: Do these limitations of algorithmsinhibit the ability to prescribe actions which are (approximately) rational? We show that,while constraints may be exploited by strategic actors, an algorithm designer with partic-ular capabilities can induce the as-if rational outcome in equilibrium. The answer to thisquestion thus depends on what we assume the algorithm is capable of. Our contributionis to identify what some of these capabilities are.Constraints on algorithms of the kind in our model are often studied in the machinelearning literature, which typically treats the data generating process as exogenous. Ourgoal, however, is to perform a similar algorithm design exercise, but in a strategic setting.To make sense of the restrictions on classifiers that can be fit to data, it may be instructiveto note that in typical machine learning problems, a simple prediction (for instance, a “yes-no” recommendation) is sought for an observation among a very large set of possibilities.Seeking to find the correct recommendation for each one may be intractable or undesireable(given data limitations), and so a simpler set may be used as a baseline. On the otherhand, it may still be possible to construct a new decision rule if the algorithm specifieshow this should be done in advance. In our model, this takes the form of assuming thealgorithm is limited in what can be fit to the data, but is otherwise flexible, in a way wewill make precise below.One interpretation of this limitation is that the algorithm suffers from a form of modelmisspecification: the true optimal decision rule for a receiver may fall outside of the classof decision rules that can be prescribed by the algorithm. There are two notable differencesfrom a standard model misspecification exercise, however. The first difference is that thealgorithm in our framework is concerned explicitly with prescribing behavior, and notwith the problem of inference per se. In the (currently very active) literature on modelmisspecification (see, for instance, Esponda and Pouzo (2014)), a decisionmaker is assumedto be potentially incorrect regarding the set of possible parameters, but otherwise uses anoptimally chosen decision rule. We, on the other hand, are not (directly) interested in
ACHINE LEARNING 3 learning the underlying parameters, but rather making an optimal prediction . The seconddifference is that the extent to which the optimal prediction falls outside of the realmof considered models is endogenous in our setting. Since we allow algorithms to specifydecision rules arbitrarily—instead constraining how models can be fit to the data—theyare, in principle, able to expand the potential decision rules the receiver could use if itis specified how this should be done. As a result, the extent to which the algorithm ismisspecified is endogenous to the constraints of the algorithm design problem.What should one expect to happen given these limitations of an algorithm? On the onehand, in order for the algorithm to be able to give non-degenerate predictions without usingdetailed knowledge of the particular parameters of the receiver’s problem, a sufficientlyrich set of classifiers should be used. We focus on cases where this criterion suggestsusing at least the set of single-threshold classifiers , which conditions a recommendationonly on which side of a threshold the observable messages lie. However, since our settingrequires strategic inference on the part of receivers , this class of hypotheses is susceptibleto manipulation by a rational sender. For our purposes, Rubinstein (1993) identified thekey economic force, studying a buyer-seller game where the buyer is restricted in the setof decision rules that can be utilized. Specifically, this paper showed that if a rational decisionmaker is restricted to use a single threshold classifier—i.e., one that makes thesame decision on a given side of a fixed threshold—then the seller can price discriminatevia a particular form of randomization which “fools” these buyers into making a decisionwhich is suboptimal given the realized price. Our framework nests Rubinstein (1993) asa special case, but considers more general environments as well.Our analysis elucidates a tension between the ability to fit rich and coarse sets ofmodels. As Rubinstein (1993) shows, if a decisionmaker is limited in the decision rulesthat can be utilized, then there is a potential for exploitation. In order to combat thistemptation, one may seek to add more possible replies to be fit to the data; in other words,to make the hypothesis class richer. Indeed, a decisionmaker could prevent the particularinstance of exploitation he highlights by doing so. However, fitting richer decision rulesmay have other undesirable consequences, and may still fail to prevent a slightly moreelaborate strategy from succeeding at exploitation. Above, we mentioned that this view iscommon in the machine learning literature; finding the best fitting model within a set ofmodels may be computationally demanding if this set is very large. A goal of our paper isto highlight this tradeoff between fitting coarse models—which have attractive statisticalproperties, but poor behavioral properties—and rich models, for which the sitution isreversed.Our proposed solution is to use the Adaptive Boosting algorithm (Schapire and Freund(2012)), which specifies exactly how to construct a decision rule as a weighted combina-tion of classifiers, with the weights specified by the algorithm. The algorithm requires The reasoning behind this result is as follows. First, the optimally chosen classifier chosen can dostrictly better than simply randomizing the guess, implying that the seller can exploit the incentives ofthe buyer in order to manipulate the decision rule. On the other hand, it is impossible for threshold rulesto implement the optimal decision with probability 1 when this rational rule is non-monotone in the price.The first point implies the buyer trades off against errors, and the second point implies that the tradeofffalls short of the fully rational response. As a result, the seller can force a different decision than wouldbe rationally optimal for these buyers (with arbitrarily high probability).
IN-KOO CHO AND JONATHAN LIBGOBER (repeatedly) fitting a classifier to some distribution over prices and outcomes, from someset of baseline classifiers.Returning to the particular question at hand, the requirement on the set of classifiersable to be fit is called weak learnability , and it is significantly less demanding than requiringall possible rational replies to be specified. We seek to highlight that this requirement isnecessary and sufficient to overcome the problem of model misspecification mentionedabove, i.e., the gap between the set of decision rules that can be fit to the data andthose that a rational receiver can utilize. We provide results which show how to checkit in several straightforward applications, particularly when resorting to single-thresholdclassifiers (which typically have natural interpretations).To summarize, the answer to our theoretical question is that rationality can be ensuredwith the ability to (a) find a best-fitting decision rule from a class which satisfies weaklearnability, and (b) combine such classifiers in a particular way (specified in advance). Itis worth emphasizing one technical difference—due to our focus on a strategic inferenceproblem—between our exercise and similar ones considered in computer science or ma-chine learning where these issues have received more attention. In principle, the rationaldecision in our model is not observed if the sender uses a strategy that does not revealthe state given an observation. In a lemons problem, for instance, it may be that “lowquality” is observed at some price, but that “high quality” is in fact more likely and thatcorrespondingly a rational buyer (receiver) would choose a “buying” action. Therefore,in our problem, the payoff-maximizing decision must be inferred and constructed by thealgorithm. One of the main results of this paper is that this added difficulty does notchange the qualitative desirable properties of the algorithm, which we show using resultsfrom large deviations theory—though this does induce some modifications on preciselyhow good of an approximation the algorithm is able to guarantee.Returning to the discussion of Rubinstein (1993), we see that the issue with singlethreshold classifiers is that they are not strong learners (i.e., they cannot ensure the op-timal decision is taken with probability 1 following any price), even though they are weaklearners (i.e., they can outperform random guesses when chosen optimally). The remark-able property of the Adaptive Boosting algorithm is that weak learnability is sufficientto construct a classifier that yields a similar guarantee as under a model class satisfyingstrong learnability. It is interesting that part of the intuition for the main result in Rubin-stein (1993)—which relies upon the buyer being able to strictly improve payoffs beyond atrivial default to induce a particular decision rule—exactly tells us how to overcome themain conclusion, once we have the algorithm in hand.At first glance, it appears that there is a significant gap between decision rules satis-fying weak learnability and those which induce rational replies. Rationality requires, inprinciple, very rich decision rules to be used, and for the performance of them to leave verylittle room for error. Weak learnability does not, and only requires a uniform improvementover a random guess. It is therefore perhaps surprising that in our exercise, the turns outto be no gap at all. Due to weak learnability, the apparent gap in rationality caused bythe limitation in the decision rules that can be fit to data can be overcome by a cleverchoice of algorithm. The result is that the algorithm can induce rational behavior withoutknowing anything beyond the observed data from past interactions. In contrast, strong
ACHINE LEARNING 5 learnability (i.e., prescribing the optimal action with high probability) will usually requireprecise knowledge of the sender’s strategy.We briefly mention that the algorithm design problem we study accommodates a richpossible action space, even with the same restrictions in the decision rules that can be fitto data. In particular, a version of the weak learnability condition in settings with twopossible receiver actions also applies to settings with an arbitrary finite number of actions.This is in sharp contrast to many other papers in the large literature on “decisionmakersas statisticians” (reviewed below), which use similar motivation to study departures fromrationality. These papers have typically focused on the binary action case. This limitationis very natural—many of the key results from machine learning which arise when thereare two possible predictions do not extend easily (or even at all) to the case of multipleactions. However, we can handle this in our problem, suggesting our algorithm is ofbroader interest. We believe that this extension is important, as it shows our conclusionsdo not hinge on other artificial limitations on the environment.Our exercise provides formalism within which machine learning methods can be appliedto answer new questions relevant to microeconomic theorists, and visa versa. Our modelis deliberately abstract, in order to provide general principles guiding when the problem ofmodel misspecification can be overcome. One key message is that while it is not possibleto guarantee that rationality emerges for arbitrarily data generating process, it is possibleif the data generating process is endogenous (due to the strategic player) to the statisticalalgorithm. This argument requires some additional steps using incentives of the actorsto demonstrate that the resulting output does in fact correspond to what is traditionallythought of as subgame perfection. This endogeneity issue makes the problem no longera pure statistical exercise. The modifications our analysis requires extend beyond theinitial need to show that it is possible to do better than random guessing in this environ-ment. As our analysis elucidates, AdaBoost is capable of handling a particular kind ofunboundedness in the cardinality of the action space. It is thus necessary to discipline theenvironment further in order to achieve our results.2. Literature
This paper takes the framework of PAC learnability, familiar from machine learning,and applies it to a strategic setting. Within economics, this agenda is most closely relatedto the literature on learning in games when behavior depends on a statistical method. Thesingle-agent problem is a particular special case, and this case is the focus of Al-Najjar(2009) and Al-Najjar and Pai (2014). However, since we are focused on a strategic setting,the data the algorithm receives is endogenous in our setting. In contrast, their benchmarkscorrespond to the case of exogenous data. This problem is also studied in Spiegler (2016),who focuses on causality and defines a solution concept for behavior that arises fromindividuals fitting a directed acyclic graph to past observations. More recently, Zhao, Ke,Wang, and Hsieh (2020) take a decision-theoretic approach in a single-agent setting withlotteries, showing how a relaxation of the independence axiom leads to a neural-networkrepresentation of preferences.Taking these approaches to games, the literature has still for the most part focused onsettings where the interactions between players is static , ruling out the main environments
IN-KOO CHO AND JONATHAN LIBGOBER we are interested in here. In contrast, our setting is a simple, two-player (and two-move)sequential game. We also note that much (though not all) of this literature focuses onbinary prediction problems, whereas we discuss how to specify algorithms in the generalfinite action cases as well. Cherry and Salant (2019) discuss a procedure whereby players’behavior arises from a statistical rule estimated by sampling past actions. This leads toan endogeneity issue similar to the one present in our environment, i.e., an interactionbetween the data generating process and the statistical method used to evaluate it. Eliazand Spiegler (Forthcoming) study the problem of a statistician estimating a model in orderto help an agent take an action, motivated as we are by issues involved with the interactionbetween rational plays and statistical algorithms. Liang (2018) also focuses on gamesof incomplete information, asking when a class of learning rules leads to rationalizablebehavior. Studying model selection in econometrics, Olea, Ortoleva, Pai, and Prat (2019)consider an auction model and ask which statistical models achieve the highest confidencein results as a function of a particular dataset. On the other hand, the literature on learning in extensive form games has typicallyassumed that agents experiment optimally, and hence embeds notion of rationality onthe part of agents which we dispense with in this paper. Classic contributions includeFudenberg and Kreps (1995), Fudenberg and Levine (1993) and Fudenberg and Levine(2006). Most of this literature has focused on cases where there is no exogenous uncertaintyregarding a player’s type, and asking whether self-confirming behavior emerges as theoutcome. An important exception is Fudenberg and He (2018), who study the steady-state outcomes from experimentation in a signalling game. While a rational agent in ourgame would need to form an expectation over an exogenous random variable, signallingissues do not arise because our sender has commitment.Perhaps closest in motivation is the computer science literature studying how well al-gorithms perform in strategic situations, as well as how rational actors may respond whenfacing them. Braverman, Mao, Schneider, and Weinberg (2018) consider optimal pricingof a seller repeatedly selling to a single buyer who repeatedly uses a no-regret learning al-gorithm. They show that, on the one hand, while a particular class of learning algorithms(i.e., those that are mean-based ) are susceptible to exploitation, others would lead to theseller’s optimal strategy simply being to use the Myersonian optimum. Deng, Schneider,and Sivan (2019) also study strategies against no-regret learners in a broad class of gameswithout uncertainty, and consider whether a strategic player can guarantee a higher pay-off than what would be implied by first-mover advantage. Blum, Hajiaghayi, Ligett, andRoth (2008) consider the Price of Anarchy (i.e., the ratio between first-best welfare andworst-case equilibrium welfare), and show in a broad class of games that this quantityis the same whether players use Nash strategies or regret-minimizing ones. Nekipelov, By itself the distinction may not immediately seem significant—after all, a Nash equilibrium in anextensive form game involves choosing a strategy to best respond to the opponent, and is usually statedas a single (and thus static) choice. However, the additional restriction to binary action or 0-1 predictionproblems makes nesting our problem less straightforward. On the question of algorithms in particular, one concern is that the algorithm design problem maybe susceptible to bias or induce unwanted discrimination when implemented, relative to rationality. SeeRambachan, Kleinberg, Mullainathan, and Ludwig (2020) for an analysis of these issues and how they maybe overcome.
ACHINE LEARNING 7
Syrgkanis, and Tardos (2015) assume players in a repeated auction use a no-regret learn-ing algorithm, making similar behavioral assumptions as we do here. Their interest is ininferring the set of rationalizable actions from data.While our motivation is very similar—and indeed, we seek to incorporate several aspectsof this literature’s conceptual framework—there are three notable differences. First, thisliterature typically assumes particular algorithms or principal objectives (such as no-regretlearning) which differ from traditional Bayesian rationality. In contrast, we maintain aBayesian rational objective for the seller, and also focus on an algorithm designer seekingto maximize the expected payoffs of agents. Second, we focus on relating the incentives ofthe rational player and the algorithm’s capabilities , and study the extent to which differentassumptions on the algorithm design problem influence the task of approximating rational-ity. Our main result articulates how different action spaces for the algorithm designer yielddifferent results regarding whether and when the outcome will approximate the rationalbenchmark. Lastly, our general framework focuses on settings with strategic inference—that is, where the payoffs following a given principal action are state-dependent—and thuscovers a set of single-agent applications which extend beyond particular pricing settings,where most (though admittedly not all) of this literature has focused. In particular, thesettings discussed in this literature do not cover Lemons markets settings (which Rubin-stein (1993) falls under, for example) or Persuasion, which form our primary startingpoint. As a result, new technical issues (e.g., dealing with residual uncertainty in the cor-rect actions) is not addressed in these papers to our knowledge. Despite these differences,our hope is that this paper inspires further connection between the economics literatureon decisionmakers as statisticians and the computer science literature on strategic choicesagainst classes of algorithms. It appears to us that these results from computer sciencehave not yet been fully appreciated in economics.3. (Sender-Receiver) Stage Games
We first describe the stage game interaction in which the algorithm designer seeks toprescribe actions on behalf of myopic actors (who may be, for instance, receivers, buyers,or agents, depending on the particular setting of interest). The stage game features astrategic actor as well. That said, our exposition in this section addresses neither how thestrategic actor chooses her strategy, nor how the algorithm is determined. This is donein Section 4, which describes the interaction which yields these objects and the relevantobjective for each actor. An important exception is Camara, Hartline, and Johnsen (2020), who study an environment coveringmany of our same applications such as Bayesian Persuasion. However, they still maintain the other twodistinguishing features, focusing on a regret objective for the principal, as well as particular no-regret as-sumptions for the agent. Still, we emphasize that both our paper and theirs focuses on environments wherethe principal/sender chooses a state-dependent strategy. This leads to the aforementioned endogeneity be-tween the data generating process (induced by the principal) and the choices of the algorithm/learner—thisemerges due to the fact that the same sender action may induce two distinct replies from the algorithmfollowing two distinct sender strategies. In their setting, this endogeneity motivates the use of “policy-regret” as an objective for the principal (due to their reinforcement learning approach to the principal’sproblem). While we do not use a regret objective for the principal, see Arora, Dekel, and Tewari (2012)and Arora, Dinitz, Marinov, and Mohri (2018) for more on the differences between these notions.
IN-KOO CHO AND JONATHAN LIBGOBER
Actions and Parameters.
The stage game is a sender-receiver game in which aninformed sender makes the first move. We often call the sender the (informed) principal ,and the receiver the agent , as our lead example is built on the informed principal problemof Maskin and Tirole (1992). However, our model also describes a sender-receiver gamewith sender commitment, as in Kamenica and Gentzkow (2011).Let Θ be the set of types endowed with a prior distribution π which is common knowl-edge among players. This type is payoff relevant to both the sender and the receiver.Define π ( θ ) as the probability that type θ ∈ Θ is realized. Throughout the paper, we onlyconsider π with finite support. Conditioned on the realized value of θ ∈ Θ, the sendertakes an action p ∈ P ⊂ R n where P is a compact subset of R n . Our analysis in most ofthe paper will assume further that |P| < ∞ , although we discuss how to modify this as-sumption in Section 5.3.3 (to allow for, for instance, continuous distributions). A strategyof the Sender is: σ : Θ → ∆( P ) , where ∆( X ) denotes the set of probability distributions over a set X . We let Σ denote theset of feasible strategies for the sender, and importantly, assume that this strategy is de-termined (and committed to) in the Algorithm Game described in Section 4. Conditionedon p (but not θ ), the agent chooses a ∈ A according to r : P → ∆( A ) . We assume | A | < ∞ , and in our analysis we treat the case of | A | = 2 and | A | > θ, p, a ) are u ( θ, p, a ) and v ( θ, p, a ).The timing of the moves in the stage game is as follows: S . An exogenous state θ ∈ Θ is realized according to π , with only the sender observingthe realized state θ . S . The sender’s action p ∈ P is realized according to σ ( p : θ ). S . The receiver takes action a ∈ A conditioned on p . S . Payoffs are realized according to u ( θ, p, a ) and v ( θ, p, a ).For instance, if we interpret p = ( p , . . . , p n ) as a contract, and a ∈ A = {− , } as“reject” ( a = −
1) or “accept” ( a = 1), the stage game is a model of the informed principal(Maskin and Tirole (1992)). If p is interpreted as a message sent by a worker, and a ∈ A as the wage paid by the firm, then the stage game becomes a signaling game (Spence(1973)). For now, we place no further restrictions on u ( θ, p, a ) and v ( θ, p, a ), though theseare often implicit in the economic problem of interest.3.2. Payoffs and the Rational Benchmark.
Describing the outcomes of the above in-teractions when the receiver is rational is a familiar exercise. In this case, his optimizationproblem is max a ∈ A X θ ∈ supp π v ( θ, p, a ) π ( θ : p )where π ( θ : p ) is the posterior probability assigned to θ conditioned on p . If p is used witha positive probability by σ , then π ( θ : p ) is computed by Bayes rule: π ( θ : p ) = σ ( p : θ ) π ( θ ) P θ ′ σ ( p : θ ′ ) π ( θ ′ ) . ACHINE LEARNING 9
We define the rational label to denote the receiver’s strategy were they rational. Moreprecisely, the optimal response is a function of the chosen σ ∈ Σ and the realized p ∈ P : y R : Σ × P → A is a solution to the following optimization problem: X θ ∈ supp π v ( θ, p, y R ( σ, p )) π ( θ : p ) ≥ X θ ∈ supp π v ( θ, p, a ) π ( θ : p ) ∀ a ∈ A where π ( θ : p ) is computed via Bayes rule whenever P θ σ ( p : θ ) π ( θ ) > Define σ R as a best response of the sender against a Bayesian rational receiver withperfect foresight: X θ,p,a u ( θ, p, a ) σ R ( p : θ ) y R ( σ R , p ) π ( θ ) ≥ X θ,p,a u ( θ, p, a ) σ ( p : θ ) y R ( σ, p ) π ( θ ) ∀ σ ∈ Σ . By the construction, ( σ R , y R ) constitutes a perfect Bayesian equilibrium in the stage-gamewith a rational receiver.3.3. Examples of Stage Games.
Before proceeding to the description of the algorithmgame, we describe a few of the stage game interactions that are of primary interest. Wewill return to these later in order to illustrate the incentives for each player.3.3.1.
Insurance.
The following is borrowed from Maskin and Tirole (1992). Supposethat the principal (sender) is a shipping company seeking to purchase insurance from aninsurance company, an agent (receiver) that is seeking to delegate the decision of whetherto offer the terms put forth by the shipping company. The principal seeks insurance everyperiod, but faces risk (e.g., due to the location of shipping demand) that is idiosyncraticevery period.In this case, we imagine the principal choose terms within some compact set
P ⊂ R ,where p = ( x, q ) denotes a policy which provides a payment x in the event of a loss, andcosts an amount q . If θ ∈ { L, H } (with L < H ) denotes the probability of a loss, then theprincipal’s utility is: u ( θ, p, a ) = ( (1 − θ ) f ( I − q ) + θf ( I − q − L + x ) a = 1(1 − θ ) f ( I ) + θf ( I − L ) a = − , for some concave f . The agent’s utility is: v ( θ, p, a ) = ( q − θx a = 10 a = − P whereby, against a rational buyer, the principal would seeka high level of insurance when risk is high (i.e., θ = H ), and avoid insurance when riskis low (i.e., θ = L ). In contrast, the agent’s payoff may be decreasing in the quantity ofinsurance when θ = H , while increasing in the quantity of insurance when θ = L . For a fixed σ , y R ( σ, · ) : P → A is a strategy of the agent, satisfying sequential rationality. Labor Market Signaling.
Our framework is general and can be expanded to coverother settings as well. Let us consider a labor market signaling model. Here, the “receiver”takes the role of the firm and the “sender” takes the role of the worker from the Spencesignalling model (as in, for instance, Maskin and Tirole (1992)). The true state is theproductivity of the worker θ ∈ Θ = { H, L } where π ( H ) = π ( L ) = : H > L . Conditionedon θ , a worker chooses p which we interpret as education level. Her strategy is σ : Θ → P ⊂ R + . The payoff function of the sender is u ( θ, p, a ) = a − pθ + 1We abstract away the competition among multiple firms in the labor market. Conditionedon p , the labor market wage is determined according to the expected productivity E ( θ : p )conditioned on p . The firm has to pay the worker the equal amount of the expectedproductivity because of (un-modeled) competition among firms. The receiver’s goal is tomake an accurate forecast about the expected productivity of the worker. The payoff ofthe receiver is v ( θ, p, a ) = − ( θ − a ) If the support of σ ( p : H ) is disjoint from the support of σ ( p : L ), σ is a separatingstrategy. If a separating strategy is an equilibrium strategy, then the equilibrium is calleda separating equilibrium. We often focus on the Riley outcome, which maximizes the exante expected payoff of the principal among all separating equilibria.3.3.3. Monopoly Market.
In the first two examples, only the sender has private informa-tion. We can allow the stage game interaction to feature additional parameter observedonly by the receiver, denoting this by i . We denote the sender’s payoff by u ( θ, p, a, i ),and the receiver’s payoff by v ( θ, p, a, i ). For example, the sender may interact with somereceivers who are algorithmic, and others who are fully rational. The agent knows whetherhe is algorithmic or fully rational. The principal does not observe the type of an agent, butonly knows the probability distribution. Indeed, the setting of Rubinstein (1993) featuressuch a dichotomy, as we discuss.While Rubinstein (1993) differs expositionally, we review the key ideas and describehow it falls under our framework. Suppose θ ∈ Θ = { L, H } . v θ is the marginal utility ofthe good where v H > v L >
0. The prior probability distribution is π ( H ) = π ( L ) = . Theseller observes θ . The seller chooses a price p ∈ [ v L , v H ] ⊂ P ⊂ R , conditioned on θ ∈ Θ.The action of a buyer is a ∈ A = {− , } . A buyer responds to p ∈ P by purchasing( a = 1) or not purchasing ( a = −
1) the good at p .The seller is facing a unit mass of infinitesimal buyers, who can be either type 1 or type2. The proportion of type 1 buyer is r ∈ (0 , θ = L , the product costs c L for the seller regardless of the types of the buyer. If θ = H ,the product costs c i to serve type i buyer ( i ∈ { , } ). We assume c > v H > c > v L > c L (3.1) rc + (1 − r ) c > v H (3.2) ACHINE LEARNING 11 so that the agent is exposed to the lemon’s problem. We focus on π supported on param-eters which satisfy (3.1) and (3.2).A buyer generates utility only if he purchases the good, whose payoff function is v ( θ, p, a, i ) = ( a = − v θ − p if a = 1 . regardless of i . The payoff of the seller is u ( θ, p, a, i ) = a = − p − c L if a = 1 , θ = Lp − c i if a = 1 , θ = H. The unique equilibrium strategy of the seller is σ R ( θ ) = ( v L if θ = Lv H if θ = H. The buyer’s equilibrium strategy is y R ( σ R , p ) = ( p ≤ v L − p > v L . The trading occurs only if θ = L , and therefore, the equilibrium is inefficient. Note thatthe construction of y R requires a precise information about v L .3.4. Introducing Time.
Our question of interest is whether the receiver can learn therational label y R , if the stage game is repeated over time. As an intermediate step towarddefining algorithm games, we describe our approach and assumptions involved with thisstep. In the next section, we discuss the algorithm choice that occurs on top of this.By expanded stage game, we refer to a repetition of the stage game interaction, playedover discrete time t = 1 , , . . . , where the stage game interactions described previouslyoccur at every t ≥
1. Let ( σ t , r t ) be the pair of strategies by the sender and the receiver inperiod t . The true state θ is drawn IID across periods according to π and the pair ( p t , a t )of the price and the action in period t is selected by ( σ t , r t ). In this case, the expectedpayoff of the sender is lim T →∞ E T T X t =1 u ( θ t , p t , a t ) π ( θ t ) σ t ( p t : θ t ) r t ( a t : p t ) (3.3)and the expected payoff of the receiver islim T →∞ E T T X t =1 v ( θ t , p t , a t ) π ( θ t ) σ t ( p t : θ t ) r t ( a t : p t ) . (3.4) For now, we are intentionally vague about the strategy space of each player in the expanded stagegame, as this is described in the next section. Our results are most elegantly stated in the undiscounted limit. Prior versions of this paper consideredthe case where future payoffs were discounted at rate δ <
1; the main lessons remain valid for δ sufficientlylarge, although there are some added technical difficulties in the analysis of Section 5.3.3 this introduces. Algorithm game
Having outlined the basic timing of moves, we now describe the “super” game whichdetermines the player’s strategy in the expanded stage game. We refer to this as an algorithm game. Throughout this paper, we assume that the sender (principal) is fullyrational, but the strategic choice of the receiver (agent) must be delegated to an algorithm.4.1.
Choices of Algorithms.
We will refer to the strategy a receiver uses—which isoutput by the algorithm at every time—as a classifier , in line with the machine learningand computer science literature:
Definition 4.1.
A classifier is a function γ : P → A. This may additionally be referred to as either a strategy or a forecasting rule.
In order to construct the classifier, the algorithm faces some computational constraints.More precisely, we assume that there is a fixed set of classifiers H (referred to as the hypothesis class ) for which the algorithm can solve the following problem:min h ∈H X p [ h ( p ) = y ( p )] L ( p ) , (4.5)for an arbitrary function L and function y : P → A . We refer to this step as finding the best fitting hypothesis . We can think of L as being the cost of misclassifying a particularobservation, which may vary. Note that, since we can add arbitrary constants to L andnormalize so that it sums to 1 over all p , it is equivalent to assume the algorithm can solvemax h ∈H X p [ h ( p ) = y ( p )] d ( p ) , (4.6)for a probability distribution d over p . This provides an alternative interpretation, regard-ing the classifier seeking to make the correct guess with the highest possible probability.We treat the process of finding the best fitting hypothesis as a black box. The purposeof this paper, however, is to understand how the algorithm designer might utilize fromadditional capabilities, and across a variety of environments. One question is which kindsof additional capabilities are necessary. The main ones we will discuss are: • Constructing labels based on observations, • Creating classifiers derived from solutions to the above maximization, • Changing observations of p t to ˆ p t , if the data is generated by a randomized rule.One hypothesis class is of particular interest. Let H ( λ, ω ) be a hyperplane in R n : ∃ λ ∈ R n and ω ∈ R such that H ( λ, ω ) = { p ∈ R n : λp = ω } . Define H + ( λ, ω ) as the closed half space above H ( λ, ω ): H + ( λ, ω ) = { p ∈ R n : λp ≥ ω } . ACHINE LEARNING 13
Definition 4.2.
A single threshold (linear) classifier is a mapping h : P → A where ∃ a + , a − ∈ A , λ ∈ R n and ω ∈ R such that h ( p ) = ( a + if p ∈ H + ( λ, ω ) a − if p H + ( λ, ω ) . Definition 4.3.
Let Γ be the set of all classifiers, and ˜Γ ⊂ Γ denote a subset of classifiers.A statistical procedure or algorithm is an onto function τ : D → ˜Γ , where D is a set of histories, T is the set of feasible algorithms (i.e., a subset of the setof functions from D into ˜Γ ). What D consists of is very much problem specific. In a typical learning model, we assumethat the receiver observes the realized outcome ( p t , a t ) in period t but also can access someinformation about the performance of his choice a t to achieve his goal. For example, ifthe goal of the receiver is to learn the rational label y R , a natural candidate would be asufficient statistics of the ex-post payoff v ( θ t , p t , a t ).One specification of T emerges from not having any restrictions on ˜Γ at all. In general,the set ˜Γ will be implicit in the description of the algorithm. Our main interest is inunderstanding which kinds of T enable the receiver to approximate the rational label y R .4.2. Timing and Objectives.
An algorithm game takes the interaction in the stagegame as a starting point, and considers the outcome when, instead of having the receiver’sstrategy emerging from Bayesian rationality, it instead emerges from fitting a model topast observations.An algorithm game is a simultaneous move game under asymmetric information betweenthe (rational) sender and the boundedly rational (“algorithmic”) receiver, built on the“expanded” stage game. A − . According to some prior distribution, nature selects the distribution π of the un-derlying game from Π, where Π is a subset of probability distributions over Θ withfinite support. A . Conditioned on realized π , the sender commits to some strategy σ . The receiver(or alternatively, an entity acting on the receiver’s behalf) commits an algorithm τ ∈ T without observing the realized π ∈ Π. A . The expanded stage game is played, with the receiver’s strategy in each period t being τ ( D t )( p ) (i.e., the action specified by the algorithm following sender’s action p at time t ), with the algorithm adding the observation (which includes p andex-post utility following each receiver action) to the dataset at the end of eachperiod.These actions determine the realized payoffs by each player, as described in the previoussection. Notice that we do not necessarily assume that any pair π , π ∈ Π have intersectingor even overlapping support (though this is also certainly allowed). Correspondingly, weemphasize we do not assume Θ is itself finite, even though all π ∈ Π have finite support.
Additionally, since we only assume the algorithm observes the receiver’s ex-post utility, θ itself need not ever observed by the algorithm. For instance, θ may reflect a productioncost which only influences the sender’s payoff. In this case, while the algorithm wouldobserve the receiver’s payoff from each action, they would not observe the seller’s cost.The sender chooses σ once and for all, with the action p t drawn i.i.d. over time. Onthe other hand, the receiver’s strategy in period t is determined by the algorithm τ andhistory D t in period t . The expression for the payoffs of the sender and the receiver are(3.3) and (3.4), respectively, where r t ( · : p ) is given by τ ( D t )( p ).We consider the objectives of the rational player and the algorithmic player separately.The former is straightforward; given a sequence ( θ t , p t , a t ), the rational player’s payoff(i.e., the sender’s payoff) function is simply the long run average expected payoff: U s ( σ, τ ) = lim T →∞ T T X t =1 E u ( θ t , p t , a t ) , where ( p t , a t ) is generated by ( σ, τ ) in period t and the expectation is otherwise conditionedonly on π ∈ Π (recalling that θ is taken to be drawn IID). The objective of the rationalsender is to maximize U s by choosing σ , conditioned on π ∈ Π. The payoff function of thealgorithm player is also the long-run average expected payoff: U r ( σ, τ ) = lim T →∞ T T X t =1 E v ( θ t , p t , a t ) . Note that implicitly we the players do not discount future payoffs, we call the algorithmgame an algorithmWe are interested in comparing the outcomes induced by the algorithm and the rationallabel y R ( σ, p ) introduced in the last section. We note that the comparison is potentiallyunfair because algorithms are more constrained in the decision rules that can be used. Wetherefore introduce a notion of rationality reflecting these limits: Definition 4.4.
An algorithm τ is constrained rational , if ∀ ǫ, δ > , ∀ σ , ∃ T such that ∀ t ≥ T , P X θ ∈ supp π v ( θ, p, τ ( D t )( p )) σ ( p : θ ) π ( θ ) ≥ max h ∈ ˜Γ X θ ∈ supp π v ( θ, p, h ( p )) σ ( p : θ ) π ( θ ) − ǫ ≥ − δ, with the probability referring to uncertainty over D t . An algorithm τ is fully rational if ˜Γ is replaced by the set of all h : P → A . The “constrained” qualifier is due to the limits on the strategies that can be chosen by thereceiver. A fully rational receiver would choose y R ( σ, p ) in the stage game; a constrainedrational algorithm yields actions are as optimally as possible, given that its output mustbe within the expanded model class ˜Γ. We often regard γ ∈ ˜Γ as a forecasting rule and τ as a formal procedure to construct a (strong) forecasting rule.In later sections, we will also discuss an important performance criterion of an algorithmis PAC (Probably Approximately Correct) learnability (Shalev-Shwartz and Ben-David(2014)). ACHINE LEARNING 15
Definition 4.5.
Algorithm τ is PAC (Probably Approximately Correct) learnable if ∀ σ ∈ Σ , ∀ ǫ > , ∃ T such that ∀ t ≥ T P (cid:16) τ ( D t )( p ) = y R ( σ, p ) (cid:17) < ǫ, with the probability referring to uncertainty over D t and the realized p . A key difference between Definitions 4.4 and 4.5 is that the latter is a condition on theactions themselves and the decision rule, yet the former is a condition on the utility. Inorder to learn the equililbrium outcome y R ( σ R , · ), σ R must be a best response to thedecision rule induced by the algorithm in the long run. Definition 4.6.
An outcome ( σ, τ ) of the algorithm game emulates ( σ R , y R ) of theunderlying stage game, if σ = σ R and τ is fully rational. The substance of the definition is that σ R is a best response to τ . Then, along theequilibrium path of the algorithm game, the receiver behaves as if he perfectly foresees σ R and responds optimally subject to the feasibility constraint imposed by ˜Γ.4.3. Specifying T . Our main interest will be in the case where T is restricted to emergeas the outcome of an ensemble algorithm . Definition 4.7.
Classifier H is an ensemble of H if ∃ h , . . . , h K ∈ H and α , . . . , α K ≥ such that H ( σ, p ) = arg max a K X k =1 α k [ a = h k ( σ, p )]Without loss of generality, we can assume that P Kk =1 α k = 1, since if not we can simplydivide by this sum and obtain the same classifier. We can interpret H as a weightedmajority vote of h , . . . , h K . An ensemble algorithm constructs a classifier through a linearcombination classifiers from H . Since the final classifier is constructed through a basicarithmetic operation, one can easily construct an elaborate classifier from rudimentaryclassifiers. Ensemble algorithms have been remarkably successful in real world applications(Dietterich (2000)).The algorithms produce an output ensemble classifier according to a recursive scheme: • First, the loss function in (4.5), say L , or probability distribution in (4.6), say d , is taken to treat all observed sender actions symmetrically—that is, L ( p ) = d ( p ) = 1 /G where G is the number of elements in the support of mixed strategy σ . • At each stage k = 1 , . . . , the best fitting hypothesis is found by solving either (4.5)or (4.6). The best fitting hypothesis is referred to as h k . • The term α k is then determined, possibly as a function of the objective of the bestfitting hypothesis. • Depending on h k and α k , the loss function L k is updated to L k +1 (or, in the caseof distributions, d k is updated to d k +1 ). • After repeating this iteration K times, a classifier of the form of Definition 4.7 isoutput, which is used to determine the final choice of the receiver. The ability to use an ensemble algorithm allows additional richness in the set of classi-fiers that can be used. There remain, however, a number of challenges: • Clearly, repeatedly solving the same problem will not yield different outcomes, andso to meaningfully expand H one needs to determine how to change the objectiveto be fit as well, and • Weights must be specified in advance.Both of these are on top of the need to potentially alter the observed p t and determin-ing the labels y t ( σ, p t ) to use for the observations, since the observed utility-maximizingdecision need not coincide with the rational one ex-post. Remark 4.8.
The reader may still wonder why algorithm design is necessary in the firstplace. For instance, if y R ( σ, p ) is a single threshold rule, it may be surprising that simplyfitting the optimal single threshold rule to the data is insufficient to emulate rationality.While it may be sufficient in some cases, it is not in general, and in particular the abilityto emulate rationality does not follow from the rational reply being in H ; the reason isthat it is necessary to be able to construct richer rules in order to deter the sender fromdeviating to exploit limitations in H , which would prevent the receiver from choosing therational reply “off-path.” This is articulated in Section 5.3.2, where we also clarify therole of taking a richer set of possible Π , to correspondingly justify choosing a sufficientlyrich set of H to begin with. Main Results
We now present our main results, showing the existence of an equilibrium of the algo-rithm game where the rational reply is emulated. We begin with a preliminary observation,useful for understanding our subsequent analysis: PAC learnability is a sufficient conditionfor the algorithm game to have a Nash equilibrium emulating ( σ R , y R ( σ R , p )). Proposition 5.1. If τ is PAC learnable, then ( σ R , τ ) is a Nash equilibrium of the algo-rithm game which emulates ( σ R , y R ( σ R , p )) .Proof. If τ is PAC learnable, then the receiver learns σ accurately in the long run. Thus,the long run average expected payoff of the sender is U ( σ, τ ) = E θ u ( θ, σ, y R ( σ, σ ( θ )))By the definition, σ R = arg max E θ u ( θ, σ, y R ( σ, σ ( θ ))) . By PAC learnability, lim t →∞ P [ τ ( D t )( p ) = y R ( σ R , p )] = 0 , implying that E [ v ( θ t , p t , a t )] → E [ v ( θ t , p t , y R ( σ R , p )] as t → ∞ . This implies the long rundiscounted payoffs are equal to those obtained against a rational player, and hence ( σ R , τ )constitutes a Nash equilibrium which emulates ( σ R , y R ( σ R , p )). (cid:3) This observation suggests it suffices to show the PAC-condition holds; in that case, thesender would find it optimal to choose σ R , and by definition it would not be possible forthe receiver to outperform rationality. However, there are two main difficulties which weseek to emphasize: ACHINE LEARNING 17 (1) First, it may be that y R ∈ H , and yet if H is limited then the rational outcomecannot be emulated without expanding the set of feasible decision rules, and(2) Second, one still needs to specify how the algorithm should use the historical datain inferring the correct decision.This section addresses each of these issues. We first consider the case where the receiverknows the values of y R ( σ, p ) ∀ ( σ, p ). We then show, in Section 5.1, that the PAC-conditionholds for an algorithm: Proposition 5.2.
If the receiver knows the values of y R ( σ, p ) ∀ ( σ, p ) , there exists analgorithm τ A that is PAC learnable. Thus, ( σ R , τ A ) is a Nash equilibrium of the algorithmgame, which emulates ( σ R , y R ) . We then turn to the case where the algorithm cannot observe y R ( σ, p ). This yields analgorithm τ ˆ A , which coincides with τ A with the added step of inferring the labels. Weshow that we obtain an analogous result for this case in Section 5.2: Proposition 5.3.
Suppose that y R ( σ, p ) is a strict best response ∀ σ but the receiver doesnot observe the values of y R ( σ, p ) . Then, there exists an algorithm τ ˆ A that is PAC learn-able. ( σ R , τ ˆ A ) is a Nash equilibrium of the algorithm game that emulates ( σ R , y R ) . In our analysis, the first step is to construct an algorithm that generates an accurateforecast in the long run. The remaining step is to show whether the sender has an incentiveto choose σ R against τ ˆ A in the algorithm game, in Section 6.5.1. Specifying the Algorithm and Weak Learnability (Proposition 5.2).
Weak Learnability.
The sufficient condition which ensures we can approximate anarbitrary decision rule combining single-thresholds is weak learnability . Roughly speaking,weak learnability says that the hypothesis class can outperform someone who had somevery minimal knowledge of the truth of the hypothesis. That is, it must be that thehypothesis class can do better than a someone who made a random guess, which would bemade correct with some arbitrarily small probability. While this may seem permissive—and indeed, it is certainly less stringent than requiring it can approximate the truth withhigh probability—the difficulty in achieving it is the fact that this guarantee must beuniform over all possible distributions.We formally define this as follows:
Definition 5.4.
Let P ( σ ) be the support of σ . If h solves X p ∈ P ( σ ) d ( p ) [ y ( p ) = h ( p )] ≥ X p ∈ P ( σ ) d ( p ) [ y ( p ) = h ( p )] ∀ h ∈ H ,h is an optimal weak hypothesis. Definition 5.5. If | A | = 2 , a hypothesis class H is weakly learnable if, for every distribu-tion d over observations p ∈ P ( σ ) and labels y ( p ) , the optimal weak hypothesis satisfies: X p ∈ P ( σ ) D ( p )( [ y ( σ, p ) = h ( p )] − [ y ( σ, p ) = h ( p )]) ≥ ρ. If | A | > , a hypothesis class H is weakly learnable if, for every distribution d overobservations p ∈ P ( σ ) and labels y ( p ) , the optimal weak hypothesis satisfies: X p ∈ P ( σ ) [ h ( p ) = y ( p )] d ( p ) ≤ X p ∈ P ( σ ) E ˜ y ∼ B [(1 − ρ ) [˜ y = y ( p )]] d ( p ) , for some ρ > and some distribution B over A . The second condition is a generalization of the first, though the first is perhaps morefamiliar from the machine learning literature (as most attention has focused on the two-label case). This condition reflects the idea that the classifier randomly guesses the labelaccording to some distribution B , but is “flipped to being correct” with probability ρ . Forthe | A | > no recursive ensemble algorithm can be built to approx-imate y ( p ) based on H alone. Perhaps more surprising is that it is tight, a fact which wediscuss further in Section 5.3.1. For now, we simply mention that if we take H , the set ofsingle threshold classifiers is weakly learnable. Proposition 5.6.
The set of single-threshold classifiers satisfies the weak learnabilitycondition of Definition 5.5.Proof.
See Appendix A. (cid:3)
Our proof uses the important fact: Any hypothesis class that contains all label permu-tations can at least match the random guess guarantee. The proof of this intermediatelemma uses a duality argument in order to show that no distribution can lead to a lowerpayoff when this condition is satisfied. Importantly, however, this is true for any hypoth-esis class, including the trivial one. This observation allows us to show that the addedrichness of single-threshold classifiers is sufficient to provide the additional gain over ran-dom guessing.5.1.2.
From Weak Learnability to Decision Rules.
For simplicity, we present the case where A = {− , } leaving the general case to the appendix. The formal description of the algorithm takestwo steps. First, we describe an algorithm under the assumption that the receiver knowsthe value of y R ( σ, p ) ∀ p . If A contains two elements, the specification of the algorithmparameters coincides with the Adaptive Boosting algorithm τ A of Schapire and Freund(2012). We first outline the parameters and then review, for completeness.The k th stage (initializing with the uniform distribution if k = 1) starts with probabilitydistribution d k ( p ) over the support of σ . Define ǫ k = P d k (cid:16) h k ( p ) = y R ( σ, p ) (cid:17) (5.7) For example, imagine H only consists of trivial classifiers. A corollary of a result in Appendix A.1 isthat these classifiers can do equally well as a random guesser. However, it is clear that they cannot dostrictly better, as they are restricted to giving the same guess to all possible p , unlike a random guesserwho is correct with an added probability ρ . ACHINE LEARNING 19 as the probability that the optimal classifier h k at k misclassifies p under d k . If ǫ k = 0,then we stop the training and output h as the forecasting rule, which perfectly forecasts y R ( σ, p ).Suppose that ǫ k >
0. Define α k = 12 log 1 − ǫ k ǫ k (5.8)The weak learnability of the single threshold rule implies that ∃ ρ > ǫ k ≤ − ρ ∀ k ≥ . Define for each p in the support of σ , and each pair ( p, y R ( σ, p )), d k +1 ( p ) = d k ( p ) exp( − α k y R ( σ, p ) h k ( p )) Z k where Z k = X p d k ( p ) exp( − α k y R ( σ, p ) h k ( p )) . Given d k +1 , we can recursively define h k +1 and ǫ k +1 , both of which are functions of d k +1 as per the above.The decision of the receiver is based upon τ A ( D k )( p ) = arg max a ∈ A k X t =1 α t ( h t ( p ) = a )which is equivalent to τ A ( D k )( p ) = sgn k X t =1 α t h t ( p ) if A = {− , } , where sgn ( x ) is the sign of real number x .Following Schapire and Freund (2012), we can show that P (cid:16) τ A ( D t )( p ) = y R ( σ, p ) (cid:17) ≥ − e − tρ ( G ) (5.9)for any mixed strategy σ , where G is the number of elements in the support of σ . Inferring the Rational Label (Proposition 5.3).
Next, we drop the assumptionthat the receiver can observe σ so that he can calculate the expected utility conditionedon p in the support of σ : X θ v ( θ, p, a ) µ ( θ : p ) (5.10)where µ is computed via Bayes rule, and therefore, knows the value of y R ( σ, p ). If thereceiver does now know y R ( σ, p ), then he cannot calculate ǫ k in (5.7). We need to constructan estimator ˆ y t ( p ) for y R ( σ, p ) from data D t available at the beginning of period t . Howwe construct estimator ˆ y t ( p ) depends upon the specific details of the rule of the game A sketch of the proof is in Appendix B. such as the available data and the variable of interest. We require that ˆ y t ( p ) satisfies aregularity property. Definition 5.7. ˆ y t ( p ) is a consistent estimator if ˆ y t ( p ) converges to y R ( σ, p ) in probabilityas t → ∞ . We require that ˆ y t ( p ) satisfies the large deviation property (LDP), which is a strongerproperty than consistency. Definition 5.8. ˆ y t ( p ) satisfies large deviation properties (LDP) if ∃ λ > such that, ∀ p in the support of σ , lim sup t →∞ − t log P (cid:16) y R ( σ, p ) = ˆ y t ( p ) (cid:17) ≤ λ. (5.11)If an estimator satisfies LDP, the tail portion of the forecating error vanishes at theexponential rate, as the sample average of i.i.d. random variables converges to the popula-tion mean. If an estimator fails to satisfy LDP, the finite sample property of the estimatortends to be extremely erratic (Meyn (2007)). Most estimators in economics satisfy LDP.In the three examples illustrated in Section 3.3, the variable of interest is the probabilitydistribution of the underlying valuation conditioned on p ∈ P . Let π ( v : p ) be the posteriordistribution of v conditioned on p . If v is drawn from a finite set, then π ( v : p ) is amultinomial distribution. Let ˆ π t ( v : p ) be the sample average for π ( v : p ). We know thatthe rate function of ˆ π t ( v : p ) is the relative entropy of ˆ π t with respect to π (Dembo andZeitouni (1998)) I π = X v ˆ π t ( v : p ) log ˆ π t ( v : p ) π ( v : p ) , from which we derive λ in (5.11): ∀ ǫ >
0, let N ǫ ( π ) be the ǫ neighborhood of π ( v : p ), and λ = inf ˆ π t N ǫ ( π ) I π . Note that y R ( σ, p ) = ˆ y t ( p )only if π and ˆ π t prescribe differen actions. Since ˆ π t is a consistent estimator of π , theprobability of two probability distributions prescribing two different actions vanishes. Thelarge deviation property of ˆ π t implies that ˆ y t ( p ) satisfies (5.11), if y R ( σ, p ) is a strict bestresponse.By the concavity of the logarithmic function, I π is minimized if π is a uniform distri-bution and inf π I π > . If | P | < ∞ and | A | < ∞ , we obtain the uniform version of (5.11) with respect to the trueprobability distribution. We state the result without proof for later reference. Lemma 5.9.
Suppose that ˆ y t ( p ) is consistent and satisfies (5.11) . Then, ∃ λ > suchthat lim sup t →∞ − t log P (cid:16) y R ( σ, p ) = ˆ y t ( p ) ∀ p in the support of σ (cid:17) ≤ λ. (5.12) ACHINE LEARNING 21
We construct algorithm τ ˆ A by replacing y ( σ, p ) by ˆ y t ( p ) in τ A constructed in the previoussection. More precisely, let f yt ( p ) be the empirical probability that ˆ y t ( p ) = 1 at thebeginning of period t . Thus, ˆ y t ( p ) = − − f yt ( p ). Given { d t ( p ) , ˆ y t ( p ) } p , h t solves max h ∈H X p h ( p ) d t ( p )[1 · f yt ( p ) − · (1 − f yt ( p ))]and ˆ ǫ t = X p d t ( p ) (cid:2) f yt ( p ) ( h ( p ) = 1) + (1 − f yt ( p )) ( h ( p ) = − (cid:3) . Using weak learnability, we can show that ∃ ρ > ǫ t ≤ − ρ. Since ˆ y t ( p ) has the full support over {− , } ∀ t ≥ ǫ t > . Given an algorithm τ A with observed labels, we can therefore replace it with τ ˆ A whichinvolves inferring the labels y R ( σ, · ), setting them equal to ˆ y t ( · ), for all t ≥ τ ˆ A , we can construct labels from data, and that for the hypothesis class of interestthe weak learnability condition is satisfied. The last step to show the algorithm works, inthe case where the set of possible p has finite support, is that the output of the algorithmwill indeed converge to the rational reply, as dictated by the labels, provided the weightsare specified correctly. Proposition 5.10.
Suppose that ˆ y t satisfies uniform LDP and that y R ( σ, p ) is a strictbest response ∀ p . Then, ∀ σ that randomizes over G elements of P , ∃ T and ∃ ρ ( G ) > such that P (cid:16) τ ˆ A ( D t )( p ) = y R ( σ, p ) ∀ t ≥ T (cid:17) ≥ − e − tρ ( G ) . Proof.
See Appendix B. (cid:3)
The construction of ˆ y t depends on the specifics of a problem, especially what data D t available in period t contains. In many interesting economic models, the algorithm for ˆ y t needs the knowledge of Θ rather than simply the support of π , and D t can contain at leastthe ordinal information about the performance of the decision recommended by τ ˆ A .Let us consider the insurance model illustrated in Section 3.3.1. The critical value is(5.10). Instead, suppose that the receiver can observe the average performance differenceof two actions: sgn t − X k =1 v ( θ k , p, − v ( θ k , p, (5.13) in the past. That is,ˆ y t ( p ) = ( P t − t ′ =1 v ( θ, p, − v ( θ, p, . ≥ − P t − t ′ =1 v ( θ, p, − v ( θ, p, . < . Given a probability distribution over y R ( σ, p ), ˆ y t ( p ) satisfies LDP: ∃ λ > t →∞ − t log P (cid:16) ˆ y t ( p ) = y R ( σ, p ) (cid:17) ≤ λ. We know that the large deviation rate function over a binominal distribution is uniformlybounded from below (Dembo and Zeitouni (1998)). Thus, we can choose λ > y R ( σ, p ).The ordinal information (5.13) about the average quality is necessary. Without accessto (5.13), the algorithm cannot estimate y ( σ, p ), which is critical for emulating the rationalbehavior. The information contained in (5.13) is coarse, because the algorithm does nottake any cardinal information about the parameters of the underlying game. Withoutthe cardinal information, the receiver cannot implement the equilibrium strategy of thebaseline game, which is a single threshold rule. Because the algorithm does not rely onparameter values of the underlying game, the algorithm is robust against specific detailsof the game, if the algorithm can function as intended by the decision maker.5.3. Discussion.
Accommodating Multiple Actions.
We use the Adaptive Boosting algorithm, as in-troduced by Schapire and Freund (2012), to specify the α k weights and the updates if | A | = 2. The original Adaptive Boosting algorithm only applies to the case of | A | = 2. Tohandle the case of | A | >
2, we appeal to a generalization introduced by Mukherjee andSchapire (2013).The | A | > | A | = 2 case, with one minor drawback, whichis that the learnability constant must be computed in advance. While our work showsan algorithm exists, the computation of the learnability constant is more indirect andhence explicitly finding a parameter that works is more difficult. The arguments for theseproofs follow from results in the machine learning literature (see Schapire and Freund(2012)), which we can apply to show that this algorithm can yield a response for whichthe misclassification probability vanishes.The proof of Proposition 5.10 is stated for the general case. The proof reveals thatthe rate at which the probability of misclassfication vanishes is determined entirely by thenumber of sender actions in the support of σ . Thus, the algorithm is efficient in that itmaintains an exponential rate of convergence (Shalev-Shwartz and Ben-David (2014)). At the end of each period, the receiver is supposed observe the performance difference. If not, wecan devise an experimentation strategy to infer the average performance difference following the idea ofexploration and exploitation.
ACHINE LEARNING 23
On the Necessity of Expanding H . So far, our analysis has assumed that the set ofinitial classifiers H contains the set of single-threshold classifiers, we have shown that therational reply can be guaranteed by an algorithm. We now show that this result requiresthe ability to construct algorithms to expand the set of possible decision rules, even if y R ( σ, p ) ∈ H , given sufficient richness in the set Π—recall that we do not necessarilyassume | Θ | < ∞ , so that different π with non-overlapping support may be possible; thatis, if the designer seeks to provide rational replies in a variety of different environments,one cannot simply find the best fitting hypothesis within H to emulate rationality.Indeed, it is straightforward to find conditions on Π, the set of possible distributionsover Θ (all of which, we assume, have finite support), such that the optimal decision ruleis of the threshold form. In fact, the algorithm designer may improve upon the rationalreply given knowledge of π . To illustrate, suppose u ( θ, p, a ) is independent of θ , and weaklyincreasing (coordinatewise) in p , for each a (the latter of which would hold if, for instance, p were a menu of prices). The following simple result shows that in this case, at least(increasing) single threshold classifiers should be included: Proposition 5.11.
Suppose u ( θ, p, − u ( θ, p, is constant in θ and weakly concave in p .Suppose further that u ( θ, p ∗ , > u ( θ, p ∗ , . Then there exists a single threshold classifierwhich the algorithm could commit to using which ensures the strategic player chooses p ∗ with probability 1. The proof follows immediately from an observation that the set of p at which the consumerchooses a = 1 is convex under the conditions of the proposition. In order for the algorithm designer to improve upon a degenerate prescription to alwayschoose a = 0, Proposition 5.11 suggests including at least single threshold classifiers whichare increasing. Against the highlighted π , such prescriptions would give the receiver evenhigher commitment power than the rational benchmark. In order to maximize payoffagainst richer and richer Π, more and more classifiers should therefore be included to H .This raises the question of whether adding in these classifiers goes “too far.” Namely,in seeking to maximize payoff against a rich set of possible π , does this risk doing worse against others? In fact, it may be that the receiver does worse than the rational bench-mark. Proposition 5.12.
Suppose that ˜Γ is the set of all single threshold classifiers, and supposeall π ∈ Π has binary support. For any { θ L , θ H } supporting π , suppose the following issatisfied: • The sender’s optimal p − a pair when θ = θ L is ( p ∗ L , • The sender’s optimal p − a pair when θ = θ H is ( p ∗ H , , with p ∗ L < p ∗ H . • v ( θ H , p ∗ H ,
1) = v ( θ H , p ∗ H , , • v ( θ L , p ∗ L , ≥ v ( θ L , p ∗ L , , and • v ( θ, p, − v ( θ, p, increasing in p , for all θ .Then a policy arbitrarily close to the sender’s optimal p − a pair is implementable, evenif this differs from the rational outcome under σ R . See also Gilboa and Samet (1989) for a similar observation on how the use of restricted decision rulescan be advantageous.
A setting where this sender-optimal action strategy differs from the rational outcomewas first studied, to the best of our knowledge, in Rubinstein (1993). His setting satisfiesthe conditions of the proposition. Our proof adapts his arguments to the current setting(i.e., incorporating the statistical aspect of our exercise and beyond the application heconsidered, described above), and highlights the importance of counterveiling incentivsein driving the result. The reason the sender can profitably deviate in the previous proofis because the new σ induces a non-monotone response from the receiver optimally, eventhough this is not prescribed by σ R . In contrast, decision rules with single-thresholdclassifiers must be monotone. In other words, we show that the sender can construct a strategy which ensures that thereceiver’s utility as a function of p violates single-crossing. Now, if the sender were usingthe particular σ ∗ from the previous proof, then the rational response can be achieved viaa double-threshold classifier, since there are only three optimal sender choices. But on theother hand, if the receiver were restricted to using single- or double-threshold classifiers,then one could find another strategy whereby the optimal response would be to use atriple-threshold classifier, via a similar scheme. As long as the number of thresholds usedis finite, a similar kind of exploitation would emerge.5.3.3. Accommodating Richer Principal Action Spaces.
While τ ˆ A is designed to be robustagainst parametric details of the underlying problems, the algorithm is still vulnerableto strategic manipulation by the rational sender. The proof of Proposition 5.10 revealsthat the rate of convergence is decreasing as the number of sender actions in the supportof σ increases. The sender can randomize over infinitely many messages to slow downthe convergence rate arbitrarily. That said, such manipulation would be short lived, andtherefore have limited gains. Nevertheless, in order to ensure that there are only a finitenumber of observations that the algorithm may observe, it is necessary to augment theobservation space so that the distribution facing the receiver can be treated as discrete.This section discusses how this modification can be done. Approach One: Discretization
We describe how to revise τ ˆ A accordingly to discretizethe observation space. Instead of processing individual actions, we let τ ˆ A process a groupof actions at a time, treating “close” actions as the same group. In principle, we want topartition P into a set of half-open rectangles intervals with size λ . More precisely, givensome arbitrary λ , we can partition each dimension of a rectangle containing P into thecollection of half open intervals of size λ > P j = [ p , p + λ ) , . . . , P jK λj = [ p + ( K λj − λ ) , p ]where K λj is the number of elements in the partition and j ∈ { , . . . , n } is a partiulardimension.For each element in the partition, the algorithm receives an ordinal information aboutthe average outcome from the decision, if it contains a sender action in the support of σ :ˆ y λt ( k ) = a if a = arg max X p ∈ P k v ( θ, p, a ) Even though y R ( σ R , p ) is an element of ˜Γ, y R ( σ ∗ , p ) is not an element of ˜Γ. Since σ ∗ is the choicevariable of the sender, the sender generates misspecification endogenously. ACHINE LEARNING 25 where p in the support of σ and P k is the product of partition elements. Let τ λ ˆ A be thealgorithm obtained by replacing ˆ y t ( p ) in τ ˆ A by ˆ y λt ( k ). Note that as λ →
0, the size of theindividual elements in the partition shrinks and τ λ ˆ A converges to τ ˆ A for a fixed σ .Compared to τ A and τ ˆ A , τ λ ˆ A takes only coarse information for two important reasons.First, the algorithm cannot differentiate two p s which are very close. This features makesthe algorithm robust against strategic manipulation of the sender to slow down the speedof learning. Second, the algorithm cannot detect the precise consequence of its decision,but only the ordinal information of the past decision, aggregated over time. The secondfeature allows the algorithm to operate with very little information about the details ofthe parameters of the underlying game. Approach Two: Smoothing
Discretizing the action space as above is one way of ensuringthat there are only a finite number of sender actions to worry about in the long run, andgiven a sufficiently fine discretization, any distinct p is distinguished by the algorithm.However, in principle, close sender actions may still be quite far in terms of payoffs, andonly be distinguished in the long run. That is, there is no guarnatee that for a fixedhorizon, that the algorithm is not grouping too many p possibilities. The issue is thatthe discretization approach uses no information about the receiver’s payoff function. Ourother alternative describes more explicitly how close to rationality the receiver can achive,given some fixed discretization scheme.The idea is the following: We add a small amount of noise to each observed p , with theamount of noise tending to 0 as the sample size grows large. Doing so allows us to show thatthe receiver perceives the sender’s strategy to have the property that E θ [ u ( a, θ, p ( a ) : p ]is uniformly equicontinuous (as functions of p ). As a result, if the receiver only seeks touse a strategy that is ε − optimal against σ , uniform equicontinuity implies that their bestreply can essentially be collapsed within intervals.It will additionally be important that the algorithm does not seek to make predictionsat p values where the corresponding density would be estimated to be small. Hence asecond step will be to determine whether a p realization occur in a region with sufficientlylarge probability, where the “sufficient” amount will also tend to 0 as the amount of datagrows large.Formally, suppose the algorithm observes data (( p , y ) , . . . , ( p n , y )). Let z η,i be an in-dependent random vector in the unit ball around 0 distributed according to the PDF: φ η ( z ) = 1 K exp − − (cid:12)(cid:12) z/η (cid:12)(cid:12) η | A |− , where K is a constant which ensures φ η integrates to 1. Our first augmentation is thefollowing: • Replace the observed p , . . . , p n with ˆ p , . . . , ˆ p n , where ˆ p i = p i + z η,i , with z η,i distributed according to the above.Second, it turns out that the above smoothing operation only works if the density issufficiently large. Otherwise, the smoothing noise has too much power. • For any ˜ p = (˜ p a ) a ∈ A \ a drawn, estimate the event that ˜ σ η (˜ p ) < γ by fixing some δ small and determining whether menu(s) p with max a ∈ A \{ a } ˜ p a − p a < δ occurswith frequency at least (2 δ ) | A |− γ . Recommend action a for any such p .As δ →
0, the condition holds if the density is at least γ . Together with the previous,we can show that if the receiver instead observes noisy sender actions, the perceivedsender’s strategy is sufficiently well-behaved to maintain the appropriate convergence forthe algorithm. Proposition 5.13.
Suppose the sender is restricted to choosing distributions which are ei-ther discrete or continuous with bounded density. Consider an algorithm which can ensurethat an ε -rational label is PAC-learnable, for any arbitrary ε > given a finite num-ber of possible sender actions. Then there exists a smoothing operation which maintainsPAC-learnability of ε -rationality, for every ε > . The idea of the proposition is to use the smoothing operation to show that the algorithmperceives that the sender uses a σ such that E [ u ( a, θ, p ( a )) : p ] is uniformly equicontinuous.Given that we seek ε -optimality, uniform equicontinuity allows us to essentially discretizethe menu space, transforming the environment into a much simpler one.There are two important properties of the transformation which allows us to ensure thisworks. The first is that, defining ˜ σ η ( · : θ ) to be the perceived p distribution of p i + z i , wehave: D α ˜ σ η ( p : θ ) = Z P D α φ η ( p − ˜ p ) σ (˜ p : θ ) d ˜ p, so that ˜ σ η inherits the smoothness properties of φ η . The second is that, on any compactsubset of P , we have σ η ( · : θ ) → σ ( · : θ ) uniformly. Now, in order to obtain uniformcontinuity as η →
0, it will be important that we can simultaneously ensure that thesender’s strategy does not involve dramatic movements in the conditional probability. Forinstance, suppose the sender were to use the following strategy: σ ( p : G ) = p (sin (cid:18) p (cid:19) + 1) , σ ( p : B ) = p (sin (cid:18) p − π (cid:19) + 1) , defined on an interval [0 , p ] such that both densities integrate to 1. Then P [ θ = G : p ] = 1if p = k +1 / π for some k ∈ N , and 0 if p = k +1 / π , for some k ∈ N . As k → ∞ (sothat p → p realizations which only occur with low probability according to an estimated density. Seeking to estimate the probability that all sender actions are within δ of p in order toestimate the density is just one way of doing this step; for instance, one could estimatethe CDF ˜ σ η ( p ), and use the estimated density to determine whether the observationsshould be thrown away. Ultimately, however, given the compact P , we can minimize the One may wonder why this trick works; for instance, we do not obtain the result when σ ( p : G ) =sin (cid:16) p (cid:17) +1 , σ ( p : B ) = sin (cid:16) p − π (cid:17) +1 . However, unlike the previous example, these will fail the continuityrequirement on the sender’s strategy space, which is needed in the proof.
ACHINE LEARNING 27 probability that this is done by using sufficiently low thresholds. As a result, it has avanishing impact on PAC-learnability, as well as the sender’s expected profit.6.
Review of Examples
We now verify the implications of this observation on the sender behavior in our par-ticular examples, showing that this results in the sender-preferred Stackleberg outcome isemerging. This requires us to verify the previously discussed conditions in the context ofthese applications.6.1.
Informed Principal.
The decision problem of the agent is to identify each pair ( x, q )of payment x and cost q as an acceptable constract ( a = 1) or not ( a = − H ( λ x , λ q , ω ) = { ( x, q ) : λ x x + λ q q = ω } and h ( x, q ) = ( x, q ) ∈ H + ( λ x , λ q , ω ) − τ ˆ A by estimating E v ( θ, p, a ) for each ( p, a ). Lemma 6.1.
Suppose that σ assigns a positive probability to ( x, q ) where E ( q − θx : ( q, x )) = 0 for x > . Then σ is not a best response to τ ˆ A .Proof. Let ( x, q ) be some offer such that the agent is indifferent between accepting andrejecting, so that: q − E [ θ : ( x, q )] x = 0The principal’s expected payoff is found by taking the expectation of u ( θ, ( x, q ) , a ) overall realizations of x, q . By the law of iterated expectations, this occurs if and only if theprincipal’s payoff is maximized following each realization of ( x, q ). We claim the principalis not indifferent between actions following any such ( x, q ). Indeed, letting E [ θ : ( x, q )] = r ,indifference implies:(1 − r ) f ( I − rx ) + rf ( I − L + (1 − r ) x ) = (1 − r ) f ( I ) + rf ( I − L ) . Note that equality holds if f ( y ) = y . This implies that both lotteries, whether or notthe principal accepts, have the same expected values. However, if f is concave, then since I > I − rx > I − L + (1 − r ) x > I − L , it must be that the left hand side is strictly greaterthan the right hand side.It follows that if indifference holds, the principal strictly prefers the agent accept theoffer by slightly reducing x . (cid:3) Following the same logic as in the previous example, we conclude that if σ is a bestresponse to τ ˆ A , then τ ˆ A ( D t )( q, x ) = y R ( σ, ( x, q ))with probability 1. A best reply σ to τ ˆ A emulates ( σ R , y R ( σ R , p )).6.2. Labor Market Signaling.
The firm’s objective function is to forecast the produc-tivity of the worker: v ( θ, p, a ) = − ( θ − a ) If A is a real line, then y R ( σ, p ) = arg max a ∈ A E θ (cid:2) v ( θ, p, a ) : p, σ (cid:3) where the posterior distribution over θ is calculated via Bayes rule from σ and the priorover θ . Strict concavity of v implies that y R ( σ, p ) is a strict best response ∀ σ, p .Without loss of generality, we consider a single threshold decision rule parameterizedby ( a + , a − , p ): h ( p ) = ( a + if p ≥ p a − if p < p . Let H be the set of all single threshold decision rules. In each round, h t solvesmax h ∈H E θ (cid:2) v ( θ, p, a ) : p, σ (cid:3) if the data includes σ . We construct τ A accordingly. If σ is not observable by the algorithm,we estimate the posterior distribution of σ conditioned on each p to construct τ ˆ A . If theagent learns y R ( σ, p ) eventually ∀ σ, p , then the principal’s choice σ R E [ u ( θ, p, y R ( σ, p )) : σ, p ] = X θ X p u ( θ, p, y R ( σ, p )) σ ( p : θ ) π ( θ ) . If σ R entails separation by the high productivity worker, then the Riley outcome is thesolution, that generates the largest ex ante expected surplus for the principal among allseparating equilibria. In order to satisfy the incentive constraint among different typesof the principal, the principal with θ = H incurs the signaling cost. If the signaling costoutweighs the benefit of separation, then σ R is the pooling equilibrium where both typesof the workers takes the minimal signal.The analysis is based upon the assumption that y R ( σ, p ) is a strict best response ∀ σ, p .As | A | = J < ∞ , y R ( σ, p ) may not be a strict best response for some σ and p . Let usassume that A = { a = 0 , a , . . . , a J } and a i − a i − = ∆ > a F = 1 + sup Θ >
0. Although y R ( σ, p ) may not be a strictbest response for some σ and p , the set of best responses contains at most 2 elements,which differ by ∆ >
0. Abusing notation, let y R ( σ, p ) be the set of best responses, if theagent has multiple best responses at p . Applying the convergence result, we have ∃ T suchthat, ∀ t ≥ T , P (cid:16) ∃ y ∈ y R ( σ, p ) , y = τ ˆ A ( D t )( p ) (cid:17) < e − ρt . ACHINE LEARNING 29
For a sufficiently small ∆ > σ R is either a strategy close to the Riley outcome, or thepooling equilibrium where both types of the principals choose the smallest value of p .6.2.1. Monopoly Market.
In the model of Rubinstein (1993) illustrated in Section 3.3.3,suppose that type 1 buyer is an algorithmic player, while type 2 buyer is a rational player.To simplify the model, we assume that type 2 buyer’s decision is y R ( σ, p ) ∀ σ . Assumethat type 1 buyer uses τ ˆ A . y R ( σ, p ) is not a strict best response, if E θ v ( θ, p, a, i ) = 0 ∀ a ∈ A = { , − } , ∀ i (6.14)so that the agent is indifferent between accepting and rejecting p . Thus, τ ˆ A is not PAClearnable.Still, the best response of the monopolistic seller against τ ˆ A is σ R . The critical stepis to show that a rational seller would not use any σ which assigns positive probability p > v L satisfying (6.14). Lemma 6.2.
Fix σ which assigns p > v L with positive probability, satisfying E θ v ( θ, p, ≥ . (6.15) Then, the ex ante expected profit of the principal against τ ˆ A from σ is strictly smaller thanfrom σ ′ : U ( σ R , τ ˆ A ) > U ( σ, τ ˆ A ) . Proof.
It suffices to show that if p > v L and E ( v : p ) − p ≥
0, then the expected profitfrom p is strictly less than π L v L . We write the proof in Rubinstein (1993) for the laterreference. For any price p satisfying P ( H : p ) v H + P ( L : p ) v L ≥ p, the revenue cannot exceed P ( H : p ) v H + P ( L : p ) v L but the cost is P ( H : p )(1 − r ) c + P ( H : p ) rc . Thus, the seller’s expected profit is at most P ( L : p ) v L + P ( H : p )((1 − r )( v H − c ) + r ( v H − c ))Because of the lemon’s problem,(1 − r )( v H − c ) + r ( v H − c ) < P ( H : p ) > P ( H : p ) v H + P ( L : p ) v L ≥ p > v L . Integrating over p , we conclude that the ex ante profit is strictly less than π L v L . (cid:3) Lemma 6.2 implies that again τ ˆ A , the principal will not use σ which assigns a positiveprobability to p so that both 1 and -1 are best responses. Thus, if σ is a best response to τ ˆ A , then y R ( σ, p ) is a strict best response ∀ p > v L . We can apply Proposition 5.3. Proposition 6.3.
In the example of Rubinstein (1993), if σ is a best response to τ ˆ A , then ( σ, τ ˆ A ) is a Nash equilibrium of the algorithm game, which emulates ( σ R , y R ( σ R , p )) . Conclusion
This paper has applied the framework of PAC learnability to describe the performanceof algorithms in a strategic setting. We show that as long as some initial set of classifierssatisfy weak learnability, an algorithm can be specified which ensures the receiver takes anoptimal response to the sender’s action. As noted by Rubinstein (1993), this need not bethe case when the receiver’s behavior follows from the optimally chosen single-thresholdclassifier given the sender’s strategy. However, being able to combine classifiers is enoughto overcome this limitation, even if it only remains possible to find the “best” classifierfrom within this limited class.Our general analysis has focused on settings featuring strategic inference—based onthe observed action of the strategic sender, a rational receiver would update beliefs aboutan underlying state (thus influencing the optimal response). This adds a complicatingfeature that the ex-post optimal action is only observed with noise. Yet because thisnoise diminishes with the size of the sample, we are still able to show this presents noadded difficulty (thanks to results from large deviations theory). We briefly mention thatif the amount of label noise were bounded away from zero, then our approach need not besuccessful (a well-known issue with Boosting algorithms). While a technical contribution,it is one that is necessary due to the uncertainty inherent in our applications of interest.We have sought to articulate the following tradeoff in the design of statistical algorithmsto mimic rationality: on the one hand, simply fitting a single-threshold classifier to datawill fall short of rational play and be exploited. On the other hand, it may not be clearwhy this is the end of the story. By adding the ability to fit classifiers repeatedly andcombining them in particular ways, we show how the rational benchmark can be restored.Here, we have taken as a black box the ability to fit these classifiers. But given this, ouralgorithm specifies exactly how to put these fitted classifiers together in order to constructone which can mimic rationality arbitrarily well.We have focused on a simple yet general setting where the comparison to the rationalbenchmark is most transparent. Still, we believe that many concerns highlighted by themachine learning literature regarding the design of algorithms can speak to issues of in-terest to economic theorists. Given how productive the machine learning literature hasbeen in terms of designing algorithms for the purposes of classification, we hope that ourwork will inspire further analysis of how these algorithms behave in strategic settings.
ACHINE LEARNING 31
Appendix A. Weak Learnability Proofs
The proof of 5.6 uses the following Lemma:
Lemma A.1.
Let H be an arbitrary hypothesis class with the property that for every h ∈ H and everypermutation π : A → A , the composition π ◦ h is contained in H . Then this hypothesis class can do at leastas well as a uniform random guesser.Proof. Let Π be the set of all possible permutations on A , noting that | Π | = k !. Fix an arbitrary classifier h ∈ H , and define h π = π ◦ h . Let c j,y be the cost of assigning label y to price p j . Define X π ∈ Π c j,h π ( p j ) = c j . In particular, note that this is invariant to the true label of j . As a result, the random guesser’s expectedpayoff on observation j is is c j /k !. To see this, note that h ( p j ) gives some fixed guess regarding the labelof price p j . Then randomizing over permutations is equivalent to randomizing over labels, as there are anequal number of permutations which flip the label according to h ( p j ) and every other label.We therefore obtain the following matrix equation, for an arbitrary ρ ∈ (0 , ∞ ), where the number ofcolumns is k ! and the number of rows is the number of possible prices. c j,h ( p ) · · · · · · c j,h π ( p ) − c j /k ! − c j /k ! · ρk ! ... ρk ! = Also note that: (1 /ρ, · · · , /ρ ) · ρk ! ... ρk ! = 1So as long as ρ >
0, by the theorem of the alternative, we therefore cannot have that a vector x existswith: c j,h ( p ) − c j /k !... ...... ... c j,h π ( p ) − c j /k ! · x ≥ ρ ... ρ . Let D ( p ) be an arbitrary distribution. Since P p ∈ P D ( p ) = 1, this implies we can find some π such that: X p j ∈ P D ( p j )( c j,h π ( p j ) − c j k ! ) < ρ . Taking ρ → ∞ and rearranging gives: (cid:16) E p ∼ D [ c j,h π ( p j ) ] (cid:17) ≤ E j ∼ D (cid:20) c j k ! (cid:21) Recalling again that the right hand side of this inequality is the payoff of the random guesser, we haveshown that for every possible distribution over prices, we can find some permutation which delivers a costbounded above by the random guesser. This proves the Lemma. (cid:3)
Proof of Proposition 5.6.
Let H be the set of hyperplane classifiers. We prove this by contradiction. Ifthere were no universal lower bound on the error, then we would have, for all ρ , a distribution D ρ and cost c ρj,y (without loss normalized to be on the unit sphere themselves) with the property that: max h ∈H X p ∈ P D ρ ( p ) c ρj,h ( p ) < U ρc , where U ρc is the payoff of the uniform random guesser who is correct with added probability ρ . Taking ρ → D ∗ and cost function c ∗ such that:max h ∈H X p ∈ P D ∗ ( p ) c ∗ j,h ( p ) = U c , where we note by Lemma A.1 that at least this bound can be obtained by permutation the labels ifnecessary. We will arrive at a contradiction by exhibiting a single-hyerplane classifier that achieves astrictly better accuracy, given D ∗ . Note that H contains the set of “trivial” classifiers, which give allmenus the same label. Also note that the only non-trivial case to consider is when there are at least twoprices in the support of D ∗ ; if there were only one price, then simply choosing the prediction correspondingto the label on that price would yield a perfect fit. Since, by assumption, no classifier does better thanrandom guessing, it must be the case in particular that each trivial classifier cannot exceed the random-guess bound. On the other hand, by our previous result, we know there does exist a trivial classifier whichachieves at least this bound, for any D supported on P .Let P = { p , . . . , p k } be the set of prices supporting D ∗ , and let ˜ p ∈ P be a price in that is also anextreme point of the convex hull of P . Without loss of generality, assume that ˜ p is nontrivial, in the sensethat it does not give the same cost to all labels. Note that indeed, this is without loss, since for any suchprice, the choice of classification is irrelevant. Note that ˜ p is not in the convex hull of P \{ ˜ p } . Therefore,by the separating hyperplane theorem, we can find an h ∈ H which (strictly) separates ˜ p from P \{ ˜ p } .Denote such a hyperplane by h ∗ , and note that the set of hyperplane classifiers contains classifiers whichassign any two labels (possibly the same label) to prices depending on which side of h ∗ they lie on.Also note that, again by our previous result, a trivial classifier supported on P \{ ˜ p } can achieve therandom guess guaranatee if p is distributed according to the conditional distribution on this set. In otherwords, our prior lemma implies that there exists y ∗ ∈ A such that: X p j ∈ P \ ˜ p D ∗ ( p j ) P q ∈ P \ ˜ P D ∗ ( q ) c ∗ j,y ∗ = U c ∗ . On the other hand, a classifier which separates p ˜ j from the other prices can fit p ˜ j perfectly. Thus wemust have c ˜ j,y ˜ j < E ˆ y ∼ Unif [ c ˜ j, ˆ y ] . So consider the hyperplane classifier which predicts ˜ y for ˜ p , and y ∗ for p ∈ P \{ ˜ p } , i.e., depending onwhich side of h ∗ they are on (acknowledging that this may be a trivial classifier). Denote the resultingclassifier by h . For this single-hyperplane classifier, we have X p j ∈ P D ∗ ( p j ) c j,h ( p j ) = D ∗ ( p ˜ j ) c ˜ j,y ˜ j + X q ∈ P \{ p ˜ j } D ∗ ( q ) X p k ∈ P \{ p ˜ j } D ∗ ( p k ) P q ∈ P \{ p ˜ j } D ∗ ( q ) c k,y ∗ > U c ∗ , where the inequality holds since the single-threshold classifier does strictly better on some non-trivial price,and as well on all other prices. This completes the proof. (cid:3) Appendix B. Specifying the Algorithm Parameters and the Proof of Proposition 5.10.
B.1.
Convergence of τ A . If all prices are trivial, then we will achieve a contradiction, because that implies that the classifierdoes do at least as well as the edge-over-random guesser, since all classifiers achieve the same payoff.
ACHINE LEARNING 33
B.1.1.
The | A | = 2 case. We replicate the proof in Schapire and Freund (2012) for reference. Define F t ( p ) = t X k =1 α k h k ( p ) . Following the same recursive process described in Schapire and Freund (2012), we have d t +1 ( p ) = d ( p ) exp (cid:16) − y ( σ, p ) P tk =1 α k h k ( p ) (cid:17)Q tk =1 Z k = d ( p ) exp( − y ( σ, p ) F t ( p )) Q tk =1 Z k . (B.16)Following Schapire and Freund (2012), we can show that P (cid:0) H t ( p ) = y ( σ, p ) (cid:1) = E X p d ( p ) ( H t ( p ) = y ( σ, p )) ≤ E X p d ( p ) exp( − y ( σ, p ) F t ( p )) , and P ( H t ( p ) = y ( σ, p )) = E t Y k =1 Z k . Note Z k = X p d k ( p ) exp (cid:0) − y ( σ, p ) α k h k ( p ) (cid:1) . The rest of the proof follows from Schapire and Freund (2012), which we copy here for later reference. Z t = X p d t ( p ) exp (cid:0) − y ( σ, p ) α t h t ( p ) (cid:1) = X y ( σ,p ) h t ( p )=1 d t ( p ) exp ( − α t ) + X y ( σ,p ) h t ( p )= − d t ( p ) exp ( − α t )= e − α t (1 − ǫ t ) + e α t ǫ t = e − α t (cid:18)
12 + γ t (cid:19) + e α t (cid:18) − γ t (cid:19) = q − γ t where γ t = 12 − ǫ t . By weak learnability, we know that γ t is uniformly bounded away from 0: ∃ γ > γ t ≥ γ ∀ t ≥ . Recall that the maximum number of the elements in the support of σ is N . Thus, d t +1 ( p ) = d ( p ) t Y k =1 q − γ t ≤ N (cid:16) − γ (cid:17) t ≤ N e − γ t where the right hand side converges to 0 at the exponential rate uniformly over p .B.2. The | A | > case. The specification of the algorithm can be found in Mukherjee and Schapire (2013).The proof provided below fills in some details to show that convergence holds in a self-contained way.First, initialize F y ( x i ) = 0. • From previous stage, take F ty . • At stage t , find the h ∈ H solving:min h ∈H m m X i =1 [ h t ( x i ) = y i ] ( e − η − X ˜ y = y i e η ( F t − y − F t − yi ) + [ h t ( x i ) = y i ]( e η − e η ( F t − ht ( xi ) − F t − y ( x i )) . • Define F ty ( x i ) = P ts =1 [ h t ( x i ) = y ]. The final prediction is H t ( x i ) = arg max ˜ y P Tt =1 [ h t ( x i ) = ˜ y ] . The weak learnability condition says that the hypothesis class can outperform a random guesser thatdoes better than some γ , where we allow for a potentially asymmetric cost of making different errors.We now show convergence to the rational rule: Step 1: Bounding The Mistakes : This step is as previous. We have m X i =1 [ H t ( x i ) = y i ] ≤ m X i =1 X ˜ y = y i e η ( F t ˜ y ( x i ) − F tyi ( x i )) . Indeed, the exponential is positive, so this inequality holds when y i is labelled correctly, and if the labelis incorrect, then that means that some ˜ y i satisfies F t ˜ y i ( x i ) > F ty i ( x i ). Since all exponential terms arepositive, and furthermore the exponent is positive if x i is labelled incorrectly, meaning the right hand sideis greater than 1 if mislabeled. Step 2: Recursive Formulation of the Loss
We now show that the right hand side goes to 0 at anexponential rate. We define the loss function to be: L t ( x i ) = X ˜ y = y i e η ( F t ˜ y ( x i ) − F tyi ( x i )) , ˜ L t = 1 m m X i =1 L t ( x i ) . We first express ˜ L t +1 as a function of ˜ L t . Note that F t +1˜ y ( x i ) = F t ˜ y ( x i ) for all ˜ y = h t ( x i ), and F t +1˜ y ( x i ) = F t ˜ y ( x i ) + 1 for ˜ y = h t ( x i ). The loss from a given x i changes depending on whether or not it iscorrectly classified. For any observation that is classified correctly at the t + 1th stage, we multiply thatobservation’s loss by a factor of e − η . On the other hand, for any observation that is classified incorrectlyas ˜ y , we add the following: e η ( F t ˜ y ( x i ) − F tyi ( x i )) ( e η − . So:˜ L t +1 = 1 m X i : h t +1 ( x i )= y i e − η L t ( x i ) + X i : h t +1 ( x i ) = y i (cid:18) L t ( x i ) + e η ( F tht +1( xi ) ( x i ) − F tyi ( x i )) ( e η − (cid:19) . Note that if we subtract ˜ L t from both sides, and substitute in for L t ( x i ) above, we obtain:˜ L t +1 − ˜ L t = 1 m X i : h t +1 ( x i )= y i ( e − η − X ˜ y = y i e η ( F t ˜ y ( x i ) − F tyi ( x i )) + X i : h t +1 ( x i ) = y i e η ( F tht +1( xi ) ( x i ) − F tyi ( x i )) ( e η − . Step 3: Weak Learnability
By the above, h t +1 is chosen to solve:min h ∈H m m X i =1 [ h ( x i ) = y i ] ( e − η − X ˜ y = y i e η ( F t ˜ y ( x i ) − F tyi ( x i )) + [ h ( x i ) = y i ]( e η − e η ( F th ( xi ) ( x i ) − F ty ( x i )) . In fact, using the previous step, we see that this can equivalently be expressed as ˜ L t +1 − ˜ L t . On theother hand, someone who is random guessing, but is correct with extra probability γ , will be correct withprobability − γk + γ , and guess an incorrect label ˜ y with probability − γk . Furthermore, the hypothesis classensures a weakly lower error (as measured by this cost) than the random guessing. Hence this expressionis bounded above by:1 m m X i =1 ( 1 − γk + γ )( e − η − L t ( x i ) + 1 − γk X ˜ y = y i ( e η − e η ( F t ˜ y ( x i ) − F y ( x i )) Again substituting in for L t ( x i ) and rearranging, we obtain: ACHINE LEARNING 35 (cid:18) ( 1 − γk + γ )( e − η −
1) + 1 − γk ( e η − (cid:19) ˜ L t . Putting this together, we have this is an upper bound of ˜ L t +1 − ˜ L t , and therefore:˜ L t +1 ≤ (cid:18) ( 1 − γk + γ )( e − η −
1) + 1 − γk ( e η − (cid:19)! ˜ L t . Step 4: Specifying η We are done if we can ensure ˜ L t → t → ∞ , since Step 1 shows that thisimplies that the number of misclassifications approaches 0 as well. To complete the argument, we mustspecify an η which delivers the exponential convergence. However, first note that if η = 0, the coefficienton ˜ L t in the previous inequality is 1, and the derivative with respect to η is − γ at 0, so that this expressionis less than 1, for some η >
0. Setting η = log(1 + γ ), the above coefficient on ˜ L t reduces to: z k ( γ ) z }| { (cid:18) ( 1 − γk + γ )( 11 + γ −
1) + 1 − γk γ (cid:19) . Note that z k ( γ ) is bounded above by ˜ z ( γ ) = e − γ / . Indeed, this expression is decreasing in k , with z k (0) = 1 = ˜ z (0), and z ( γ ) = 1 − γ < e − γ / = ˜ z ( γ ). Since ˜ L = ( k − L t ≤ ( k − e − γt / , as desired.B.3. Convergence of τ ˆ A . Under the assumption that y R ( σ, p ) is a strict best response,lim t →∞ ˆ y t ( p ) = y R ( σ, p )almost surely. Since ˆ y t ( p ) satisfies the uniform LDP, ∀ ǫ > ∃ ρ ( ǫ, σ ) > T ( ǫ, σ ) such that P (cid:16) ∃ t ≥ T ( ǫ, σ ) , ˆ y t ( p ) = y R ( σ, p ) (cid:17) ≤ e − tρ ( ǫ,σ ) . Since the support of σ contains a finite number of p , the empirical the multinomial probability distributionover θ .Let ˆ π t ( θ : p ) be the empirical probability distribution over Θ following t rounds of observations. By thelaw of large numbers, ˆ π t ( θ : p ) → π ( θ : p ) computed via Bayes rule from the prior distribution over θ and σ . Write Θ = ( θ , . . . , θ | Θ | ). Given ǫ = ( ǫ, . . . , ǫ ) ∈ R | Θ | , the rate function of the multinomial distributionis | Θ | X i =1 ǫ log ǫp ( θ )where p ( θ ) is the probability that θ is realized. Since P θ p ( θ ) = 1, | Θ | X i =1 ǫ log ǫp ( θ ) ≥ | Θ | Y i =1 ǫ log ǫ / | Θ | = | Θ | Y i =1 ǫ log ǫ | Θ | > . Note that the right hand side is independent of σ , which is the rate function of the uniform distributionover Θ. Thus, we can choose ρ ( ǫ ) ≤ ρ ( ǫ, σ ) uniformly over σ , which is strictly increasing with respect to ǫ >
0. We choose T ( ǫ ) independently of σ as well.Define an event L = n ˆ y t ( p ) = y R ( σ, p ) ∀ t ≥ T ( ǫ ) o We know that P ( L ) ≥ − e − tρ ( ǫ ) . Fix t > T ( ǫ ). We have P (cid:16) τ ˆ A ( D t )( p ) = y R ( σ, p ) (cid:17) = P (cid:16) τ ˆ A ( D t )( p ) = y R ( σ, p ) : L (cid:17) P ( L ) + P (cid:16) τ ˆ A ( D t )( p ) = y R ( σ, p ) : L c (cid:17) P ( L c ) ≤ P (cid:16) τ ˆ A ( D t )( p ) = y R ( σ, p ) : L (cid:17) + P ( L c ) ≤ P (cid:16) τ ˆ A ( D t )( p ) = y R ( σ, p ) : L (cid:17) + e − tρ ( ǫ ) . Following the same logic as in the proof of Proposition 5.10, we can show that ∃ γ ( G ) > Z t ≤ − γ ( G ) ∀ t ≥ τ ˆ A .Recall that F a ( p ) = t X s =1 α s ( h s ( p ) = a ) . Similarly, we define ˆ F a ( p ) = t X s =1 ˆ α s ( h s ( p ) = a ) . Following the same logic as in the proof of Proposition 5.10, we know that if τ ˆ A ( D t )( p ) = y R ( p ),ˆ F y R ( σ,p ) ( p ) + X a = y R ( σ,p ) ˆ F a ( p ) > . Thus, ( τ ˆ A ( D t )( p ) = y R ( σ, p )) ≤ ˆ F y R ( σ,p ) ( p ) + X a = y R ( σ,p ) ˆ F a ( p ) ≤ exp ˆ F y R ( σ,p ) ( p ) + X a = y R ( σ,p ) ˆ F a ( p ) . Conditioned on event L , ˆ y t ( p ) = y R ( σ, p ) ∀ t ≥ T ( ǫ ) . We can write for t ≥ T ( ǫ ), d t +1 ( p ) = ˆ d t ( p ) exp( α t ( ( h t ( p ) = ˆ y t ( p )) − ( h t ( p ) = ˆ y t ( p ))))ˆ Z t = ˆ d t ( p ) exp( α t ( ( h t ( p ) = y R ( σ, p )) − ( h t ( p ) = y R ( σ, p ))))ˆ Z t = d T ( ǫ ) ( p ) exp( P ts = T ( ǫ ) α s ( ( h s ( p ) = y R ( σ, p )) − ( h s ( p ) = y R ( σ, p )))) Q ts = T ( ǫ ) ˆ Z t . ACHINE LEARNING 37
Thus, t Y s = T ( ǫ ) ˆ Z t = X p d T ( ǫ ) ( p ) exp t X s = T ( ǫ ) α s ( ( h s ( p ) = y R ( σ, p )) − ( h s ( p ) = y R ( σ, p ))) ≥ (cid:18) min p ∈P ( σ ) d T ( ǫ ) ( p ) (cid:19) X p exp t X s = T ( ǫ ) α s ( ( h s ( p ) = y R ( σ, p )) − ( h s ( p ) = y R ( σ, p ))) . Since d ( p ) is the uniform distribution over P ( σ ),min p ∈P ( σ ) d T ( ǫ ) ( p ) > . We can write t Y s =1 ˆ Z t = t Y s = T ( ǫ ) ˆ Z t T ( ǫ ) − Y s =1 ˆ Z t ≥ (cid:18) min p ∈P ( σ ) d T ( ǫ ) ( p ) (cid:19) X p exp( t X s = T ( ǫ ) ˆ α s ( ( h s ( p ) = y R ( σ, p )) − ( h s ( p ) = y R ( σ, p )))) T ( ǫ ) − Y s =1 ˆ Z t = (cid:0) min p ∈P ( σ ) d T ( ǫ ) ( p ) (cid:1) Q T ( ǫ ) − s =1 ˆ Z t P p exp hP T ( ǫ ) − s =1 ˆ α s ( ( h s ( p ) = y R ( σ, p )) − ( h s ( p ) = y R ( σ, p ))) i × X p exp t X s =1 ˆ α s ( ( h s ( p ) = y R ( σ, p )) − ( h s ( p ) = y R ( σ, p ))) over L . Define M ( ǫ ) = (cid:0) min p ∈P ( σ ) d T ( ǫ ) ( p ) (cid:1) Q T ( ǫ ) − s =1 ˆ Z t P p exp( P T ( ǫ ) − s =1 ˆ α s ( ( h s ( p ) = y R ( σ, p )) − ( h s ( p ) = y R ( σ, p ))))which is bounded away from 0.Recall that P ( τ ˆ A ( D t )( p ) = y R ( σ, p )) ≤ X p d ( p ) exp( t X s =1 ˆ α s ( ( h s ( p ) = y R ( σ, p )) − ( h s ( p ) = y R ( σ, p )))) ≤ Q ts =1 ˆ Z t M ( ǫ ) ≤ (1 − γ ( G )) t M ( ǫ ) ≤ e − tγ ( G ) M ( ǫ ) . Combining the probabilities over L and L c , we have that ∀ ǫ , ∀ σ ∈ Σ G ⊂ Σ, ∃ T ( ǫ ), ρ ( ǫ ) and γ ( G ) suchthat P (cid:16) ∃ t ≥ T ( ǫ ) , τ ˆ A ( D t )( p ) = y R ( σ, p ) (cid:17) ≤ e − tγ ( G ) M ( ǫ ) + e − tρ ( ǫ ) . We can choose
T > T ( ǫ ) and ρ such that ∀ t ≥ T , e − tγ ( G ) M ( ǫ ) + e − tρ ( ǫ ) ≤ e − ρt which proves the proposition. Appendix C. Proofs for Section 5.3.2
Proof of Proposition 5.11.
Concave differences implies that the set K = { p : u ( θ, p, ≥ u ( θ, p, } is a convex set; if u ( θ, p i , − u ( θ, p i , ≥ i = 1 ,
2, then the same conclusion holds for αp + (1 − α ) p for all α ∈ [0 , p ∗ on the boundary, the supporting hyperplane theorem impliesthat we can find a linear hyperplane ( λ, ω ) tangent to this set at p ∗ .Suppose the algorithm designer prescribes that the receiver choose a = 1 at any menu p such that λ · p ≤ ω and a = 0 otherwise. Note that having the receiver choose a = 1 therefore requires choosing p where the sender would rather the receiver choose action a = 0, by definition of K . Therefore, the strategicplayer cannot do any better than choosing σ ( p | θ ) which is a point mass at p ∗ . (cid:3) Proof of Proposition 5.12.
The ideas in this proof are largely borrowed from Rubinstein (1993), accom-modating two additional features of our enviroment: (a) need to infer the strategy from observed dataand (b) the generalized setting, but we provide the proof for completeness. We construct a strategy σ ∗ for the sender that generates higher payoff than the equilibrium strategy σ R , thus deriving the contradic-tion that σ R is a best response to τ in the long run. More precisely, define ( p ∗ θ , a ( θ )) to be the senderpayoff-maximizing strategy. We show that the sender can induce the receiver to choose a ( θ ) = y R ( p ∗ θ ).Fix ǫ > | Θ | = 2. First suppose v ( θ L , p ∗ L , a ) = v ( θ L , p ∗ L , a ). Let˜ p ∈ ( p L , p H ) satisfies v ( θ L , ˜ p, < v ( θ L , ˜ p, p is multidimensional, we can take ˜ p tp be on the linesegment connecting p ∗ L and p ∗ H ) Set η = v ( θ L , ˜ p, − v ( θ L , ˜ p, > . We then choose ǫ, ǫ H , ǫ L > π ( H ) ǫ H < π ( L ) ǫ L , (C.18)and such that ǫ L ǫ L + η < ǫ < π ( L ) ǫ L − π ( H ) ǫ H π ( L ) ǫ L . (C.19)Under the increasing differences assumption, we can find p i ( ε i ) such that ε i = v ( θ i , p i ( ε i ) , − v ( θ i , p i ( ε i ) , . Consider the following randomized pricing rule σ ∗ of the sender: in state H , ˜ p H ( ǫ H ) is chosen withprobability 1. In state L , p L ( ǫ L ) is chosen with probability 1 − ǫ and ˜ p with probability ǫ .Under this strategy, the optimal response following ˜ p is 0, and this does not vanish as all other parameterstend to 0. However, the ex-post optimal decisions are 1 for both ˜ p L ( ǫ L ) and ˜ p H ( ǫ H ). Nevertheless, (C.19)implies first, the decisionmaker prefers to choose a = 1 if and only if ˜ p L ( ǫ L ) than choose a = 1 if andonly if ˜ p H ( ǫ H ); and second, that the loss from choosing a = 1 following ˜ p is larger than the loss fromchoosing a = 0 at ˜ p L ( ǫ L ). Putting this together, and taking ǫ, ǫ L , ǫ H → v ( θ L , p ∗ L , a ) > v ( θ L , p ∗ L , a ) is even more straightforward, since in this case the gain fromchoosing a is non-vanishing, meaning that we can set ε L = 0.The verification that the optimal rule converges to this threshold when emerging from data is straight-forward; any recursive learning algorithm generates { φ t } which converges to φ ∈ (cid:16) v L − ǫ L , v H + v L (cid:17) toemulate the best response of type 1 buyer against σ . Thus, the long run average payoff against suchalgorithm should be bounded from below by U ∗ p − ǫ . (cid:3) Appendix D. Proofs for Section 5.3.3
D.1.
Proof of Proposition 5.13.
The proof of the theorem proceeds in the following steps: • Step 1: Show that the expected value conditional on price, in the image of the sender’s possiblestrategies after applying the augmentation, is uniformly equicontinuous. • Step 2: Show that the same label is applied to E [ v θ | p + z i,η , σ, φ η ] as would be applied to E [ v θ | p, σ ], with high probability. • Step 3: Verify that the change in recommendation due to discarding “low density prices” occurswith vanishing probability.
ACHINE LEARNING 39
Putting these together shows that the change in the expectation can be made arbitrarily small, as can theprobability that small density observations are drawn. The condition that σ is either discrete or continuousis stronger than necessary; what is necessary is continuity of the conditional expectation as a function ofprice, which can be satisfied if the discrete portions and continuous portions are separated, for instance.However, the proposition highlights that we need not restrict the sender’s strategy space at all in order forour algorithm to converge.The Theorem implies that if the sender were to use an arbitrary strategy σ , the receiver could insteadfocus on finding a rational response to ˜ σ η . Doing so would still lead to PAC learnability of the approximatelyoptimal response to σ . On the other hand, we can show that the optimal response to ˜ σ η is PAC learnable(unlike, potentially, the optimal response to σ ), and doing the change leads to a negligible impact on thesender’s surplus.Before presenting the proof, we argue that uniform equicontinuity implies weak learnability. Supposethat E [ v | σ, p ] − p is uniformly equicontinuous (which holds if E [ v | σ, p ] is uniformly equicontinuous). Byuniform equicontinuity, we have there exists some δ such that whenever (cid:12)(cid:12) p − p ′ (cid:12)(cid:12) < δ , we have that (cid:12)(cid:12)(cid:12) E [ v | σ, p ] − E [ v | σ, p ′ ] (cid:12)(cid:12)(cid:12) < ε, for any σ . Suppose we have some price p such that E [ v | σ, p ] − p > ε . Then if E [ v | σ, p ′ ] − p ′ < − ε , it followsthat (cid:12)(cid:12) p − p ′ (cid:12)(cid:12) > δ . It follows that there can only be at most v H − v L δ prices such that y ( σ, p ) = − y ( σ, p ′ ),where p and p ′ are adjacent (ignoring all prices where (cid:12)(cid:12) E [ v | σ, p ] − p (cid:12)(cid:12) < ε , as the classification decision isirrelevant there).D.1.1. Step One.
We first show that E [ v θ | ˜ σ η , p ] is Lipschitz in p uniformly of ˜ σ η , noting that we arerestricting to prices where ˜ σ η ( p ) > γ . Note that:˜ σ ′ η ( p | θ ) = Z φ ′ η ( p − ˜ p ) σ (˜ p | θ ) d ˜ p ≤ max φ ′ η := φ ′ . Furthermore, we have: ddp P ˜ σ η [ θ | p ] = ˜ σ ′ η ( p | θ ) P [ θ ] P ˜ θ ˜ σ η ( p | ˜ θ ) P [˜ θ ] − ˜ σ η ( p | θ ) P [ θ ]( P ˜ θ σ ′ η ( p | ˜ θ ) P [˜ θ ])( P ˜ θ ˜ σ η ( p | ˜ θ ) P [˜ θ ]) , so: (cid:12)(cid:12)(cid:12)(cid:12) ddp P ˜ σ η [ θ | p ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ φ ′ P [ θ ] · P ˜ θ ˜ σ η ( p | ˜ θ ) P [˜ θ ] ! + φ ′ ˜ σ η ( p | θ ) P [ θ ]( P ˜ θ ˜ σ η ( p | ˜ θ ) P [˜ θ ]) ! ≤ φ ′ P [ θ ] (cid:18) γ + M ( η ) γ (cid:19) , where M ( η ) is a bound on ˜ σ η ( p | θ ) P [ θ ], which exists since σ and φ η have bounded densities. Hencewe see that for all p = p ∗ , the conditional probability is uniformly bounded in p , and is hence Lipschitzcontinuous. Importantly, the bound only depends on η and γ (and P [ θ ]), and is therefore uniform over allstrategies in the image of the augmentation. Hence we can ensure that Lipschitz continuity is mainted forall prices in the support of ˜ σ η .In fact, recall that the Lipschitz constant is equal to the L ∞ norm of the derivative. Hence Lipschitzcontinuity depends only on γ , M ( η ) and φ ′ η , meaning that the Lipschitz constant holds uniformly overthe image of the distributions emerging under the algorithm. It follows that the image is uniformlyequicontinuous.D.1.2. Step Two.
Note that since E [ v θ | σ, p ] is continuous on S = ∪ θ Supp σ ( · | θ ), E [ v θ | σ, p ] isuniformly continuous on any compact K ⊂ S . Define: K γ = { p : X θ σ ( p | θ ) P [ θ ] ≥ γ } . Using that mollifiers converge uniformly on compact sets, we have that ˜ σ η → σ uniformly on K γ . Wetherefore have that, for any ˜ ε , we can find some η such that if η < η and p ∈ K γ , then (cid:12)(cid:12) ˜ σ η ( p | θ ) − σ ( p | θ ) (cid:12)(cid:12) < ˜ ε for all θ , and (cid:12)(cid:12)P θ ˜ σ η ( p | θ ) P [ θ ] − P θ σ ( p | θ ) P [ θ ] (cid:12)(cid:12) < ˜ ε .Furthermore, since σ is uniformly continuous on K γ , we have: (cid:12)(cid:12)(cid:12) σ ( p | θ ) − ˜ σ ( p ′ | θ ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)Z φ η ( p ′ − ˜ p )( σ ( p | θ ) − σ (˜ p | θ )) d ˜ p (cid:12)(cid:12)(cid:12)(cid:12) ≤ ˜ ε, using the uniform continuity of σ on K γ .So for any p ∈ K γ , and η sufficiently small, we have (letting v = max θ v θ ): (cid:12)(cid:12)(cid:12) E [ v θ | σ, p ] − E [ v θ | ˜ σ η , p ′ ] (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P θ v θ σ ( p | θ ) P [ θ ] P ˜ θ ˜ σ η ( p ′ | ˜ θ ) P [˜ θ ] − P θ v θ ˜ σ η ( p ′ | θ ) P [ θ ] P ˜ θ σ ( p | ˜ θ ) P [˜ θ ] (cid:0)P θ σ ( p | θ ) P [ θ ] (cid:1) (cid:0)P θ ˜ σ η ( p ′ | θ ) P [ θ ] (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ ( p ) · ( γ − ˜ ε ) (cid:12)(cid:12)(cid:12)(cid:12)X θ v θ ( σ ( p | θ ) − ˜ σ η ( p ′ | θ )) P [ θ ] X ˜ θ σ ( p | ˜ θ ) P [˜ θ ]+ X θ v θ σ ( p | θ ) P [ θ ] X ˜ θ (˜ σ η ( p ′ | ˜ θ ) − σ ( p | ˜ θ )) P [˜ θ ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ σ ( p ) · ( γ − ˜ ε ) (cid:18) ≤ v ˜ εσ ( p ) z }| {(cid:12)(cid:12)(cid:12)(cid:12)X θ v θ ( σ ( p | θ ) − ˜ σ η ( p ′ | θ )) X ˜ θ σ ( p | ˜ θ ) P [˜ θ ] (cid:12)(cid:12)(cid:12)(cid:12) + ≤ v · ˜ ε · σ ( p ) z }| {(cid:12)(cid:12)(cid:12)(cid:12)X θ v θ σ ( p | θ ) P [ θ ] X ˜ θ (˜ σ η ( p ′ | ˜ θ ) − σ ( p | ˜ θ )) P [˜ θ ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ≤ v ˜ εγ − ˜ ε . The first inequality follows from adding and subtracting P θ v θ σ ( p | θ ) P [ θ ] P ˜ θ σ ( p | ˜ θ ) P [˜ θ ] to the numeratorinside the absolute value (as well as the lower bound on ˜ σ η ( p )), and the second inequality is from the triangleinequality, and the overbraced expression follows from v θ ≤ v and uniform convergence of ˜ σ η to σ .So for any fixed γ , we can find some some η such that whenever η < η , we can ensure that on K γ , (cid:12)(cid:12) E [ v θ | ˜ σ η , p ] − E [ v θ | σ, p ] (cid:12)(cid:12) < ε ∗ , by choosing ˜ ε sufficiently small so that εγ ( γ − ˜ ε ) < ε ∗ . It follows that ifthe receiver’s classifier converges to a rule that is ε -optimal under ˜ σ η , it converges to a rule that is ε + ε ∗ optimal under σ . The probability that this fails to occur is simply the probability that the price is outsideof K γ , which can be made arbitrarily small by taking γ →
0, since we can approximate the support of σ arbitrarily well.D.1.3. Step Three.
Note that, for an arbitrary continuous distribution f , if p ∼ f we have (for any compact K ): P f [ L γ ] = Z K [ p : f ( p ) ≤ γ ] f ( p ) dp ≤ Z K [ p : f ( p ) ≤ γ ] γdp ≤ µ ( K ) · γ, where µ is Lebesgue measure. It follows that the probability that p ∈ L γ , is small if γ is small, andfurthermore that this probability can be made small uniformly, using only γ .As shown by the claim above, by taking η small, we can ensure that the difference in the conditionalexpected value is small with high probability. By taking γ small, we ensure that the probability of adifferent outcome due to smoothing goes to 0, implying the result. Appendix E. Proofs for Examples
Proof of Lemma 6.2.
It suffices to show that if p > v L and E ( v | p ) − p ≥
0, then the expected profit from p is strictly less than π L v L . We write the proof in Rubinstein (1993) for the later reference. For any price p satisfying P ( H | p ) v H + P ( L | p ) v L ≥ p, the revenue cannot exceed P ( H | p ) v H + P ( L | p ) v L ACHINE LEARNING 41 but the cost is P ( H | p )(1 − r ) c + P ( H | p ) rc . Thus, the sender’s expected payoff is at most P ( L | p ) v L + P ( H | p )((1 − r )( v H − c ) + r ( v H − c ))Because of the lemon’s problem, (1 − r )( v H − c ) + r ( v H − c ) < P ( H | p ) > P ( H | p ) v H + P ( L | p ) v L ≥ p > v L . Integrating over p , we conclude that the ex ante profit is strictly less than π L v L . (cid:3) ReferencesAkerlof, G. A. (1970): “The Market for ”Lemons”: Quality Uncertainty and the Market Mechanism,”
Quarterly Journal of Economics , 84(3), 488–500.
Al-Najjar, N. I. (2009): “Decision Makers as Statisticians: Diversity, Ambiguity and Learning,”
Econo-metrica , 77(5), 1371–1401.
Al-Najjar, N. I., and
M. M. Pai (2014): “Coarse decision making and overfitting,”
J. Economic Theory ,150, 467–486.
Arora, R., O. Dekel, and
A. Tewari (2012): “Online Bandit Learning against an Adaptive Adver-sary: from Regret to Policy Regret,” in
Proceedings of the 29th international coference on internationalconference on machine learning , pp. 1747–1754.
Arora, R., M. Dinitz, T. Marinov, and
M. Mohri (2018): “Policy Regret in Repeated Games,” in
Proceedings of the 32nd international conference on neural information processing systems . Blum, A., M. Hajiaghayi, K. Ligett, and
A. Roth (2008): “Regret minimization and the price of totalanarchy,” in
Proceedings of the fortieth annual ACM symposium on Theory of computing , pp. 373–382.
Braverman, M., J. Mao, J. Schneider, and
M. Weinberg (2018): “Selling to a No-Regret Buyer,”in
ACM Conf. on ACM Conference on Economics and Computation (ACM EC) , pp. 523–538.
Camara, M., J. Hartline, and
A. Johnsen (2020): “Mechanisms for a No-Regret Agent: Beyond theCommon Prior,”
FOCS . Cherry, J., and
Y. Salant (2019): “Statistical Inference in Games,” Northwestern University.
Dembo, A., and
O. Zeitouni (1998):
Large Deviations Techniques and Applications . Springer-Verlag,New York, 2nd edn.
Deng, Y., J. Schneider, and
B. Sivan (2019): “Strategizing against No-regret Learners,” Discussionpaper.
Dietterich, T. G. (2000): “Ensemble Methods in Machine Learning,” in
Multiple Classifier Systems , pp.1–15, Berlin, Heidelberg. Springer Berlin Heidelberg.
Eliaz, K., and
R. Spiegler (Forthcoming): “The Model Selection Curse,”
American Economic Review:Insights . Esponda, I., and
D. Pouzo (2014): “An Equilibrium Framework for Players with Misspecified Models,”University of Washington and University of California, Berkeley.
Fudenberg, D., and
K. He (2018): “Learning and Type Compatibility in Signaling Games,”
Economet-rica , 86(4), 1215–1255.
Fudenberg, D., and
D. M. Kreps (1995): “Learning in Extensive Form Games I: Self-confirmingEquilibria,”
Journal of Economic Theory , 8(1), 20–55.
Fudenberg, D., and
D. K. Levine (1993): “Steady State Learning and Nash Equilibrium,”
Economet-rica , 61(3), 547–573.(2006): “Superstition and Rational Learning,”
American Economic Reivew , 96, 630–651.
Gilboa, I., and
D. Samet (1989): “Bounded versus Unbounded Rationality: The Tyrrany of the Weak,”
Games and Economic Behavior , 1, 213–221.
Kamenica, E., and
M. Gentzkow (2011): “Bayesian Persuasion,”
American Economic Reivew , 101(6),2590–2615.
Liang, A. (2018): “Games of Incomplete Information Played by Statisticians,” Discussion paper, Univer-sity of Pennsylvania.
Maskin, E., and
J. Tirole (1992): “ The Principal-Agent Relationship with an Informed Principal, II:Common Values,”
Econometrica , 60(1), 1–42.
Meyn, S. P. (2007):
Control Techniques for Complex Networks . Cambridge University Press.
Mukherjee, I., and
R. E. Schapire (2013): “A Theory of Multiclass Boosting,”
Journal of MachineLearning Research , 14, 437–497.
Nekipelov, D., V. Syrgkanis, and
E. Tardos (2015): “Econometrics for Learning Agents,” in
Pro-ceedings of the Sixteenth ACM Conference on Economics and Computation , pp. 1–18.
Olea, J. L. M., P. Ortoleva, M. M. Pai, and
A. Prat (2019): “Competing Models,” ColumbiaUniversity, Princeton University and Rice University.
Rambachan, A., J. Kleinberg, S. Mullainathan, and
J. Ludwig (2020): “An Economic Approachto Regulating Algorithms,” Discussion paper, Harvard Universitiy, Cornell University, and University ofChicago.
Rubinstein, A. (1993): “On Price Recognition and Computational Complexity in a Monopolistic Model,”
Journal of Political Economy , 101(3), 473–484.
Schapire, R. E., and
Y. Freund (2012):
Boosting: Foundations and Algorithms . MIT Press.
Shalev-Shwartz, S., and
S. Ben-David (2014):
Understanding Machine Learning: From Theory toAlgorithms . Cambridge University Press.
Spence, A. M. (1973): “Job Market Signaling,”
Quarterly Journal of Economics , 87(3), 355–374.
Spiegler, R. (2016): “ Bayesian Networks and Boundedly Rational Expectations *,”
The QuarterlyJournal of Economics , 131(3), 1243–1290.
Zhao, C., S. Ke, Z. Wang, and
S.-L. Hsieh (2020): “Behavioral Neural Networks,” Discussion paper.
Department of Economics, Emory University, Atlanta, GA 30322 USA
Email address : [email protected] URL : https://sites.google.com/site/inkoocho Department of Economics, University of Southern California, Los Angeles, CA 90089USA
Email address : [email protected] URL ::