[PDF] Learning to Personalize Treatments When Agents Are Strategic

Abstract

There is increasing interest in allocating treatments based on observed individual data: examples include heterogeneous pricing, individualized credit offers, and targeted social programs. Policy targeting introduces incentives for individuals to modify their behavior to obtain a better treatment. We show standard risk minimization-based estimators are sub-optimal when observed covariates are endogenous to the treatment allocation rule. We propose a dynamic experiment that converges to the optimal treatment allocation function without parametric assumptions on individual strategic behavior, and prove that it has regret that decays at a linear rate. We validate the method in simulations and in a small MTurk experiment.

Full PDF

LLearning Personalized Policy with Strategic Agents ∗Evan Munro † January 13, 2021

Abstract

There is increasing interest in allocating treatments based on observed individualdata: examples include heterogeneous pricing, individualized credit oﬀers, and targetedsocial programs. Personalized policy introduces incentives for individuals to modifytheir behavior to obtain a better treatment. We show standard risk minimization-basedestimators are sub-optimal when observed covariates are endogenous to the treatment al-location rule. We propose a dynamic experiment that converges to the optimal treatmentallocation function without parametric assumptions on individual strategic behavior,and prove that it has regret that decays at a linear rate. We validate the method insimulations and in a small MTurk experiment.

Keywords:

Design of Experiments, Robustness, Optimization

JEL Codes:

D82, C90, C61 ∗ I thank Mohammad Akbarpour, Susan Athey, Anirudha Balasubramanian, Martino Banchio, AlexFrankel, Guido Imbens, Stefan Wager, Bob Wilson, and Kuang Xu for helpful comments and discussions, andthe staﬀ at the GSB Behavioral Lab for assistance with running the Amazon Mechanical Turk experiment.Replication code and data for the analysis in the paper is available at http://github.com/evanmunro/personalized-policy . † Stanford Graduate School of Business. [email protected] a r X i v : . [ ec on . E M ] J a n Introduction

Increasingly sophisticated machine learning methods are used to estimate heterogeneouscausal eﬀects from complex social and economic individual-level data (Nie and Wager,2017; Wager and Athey, 2018; Künzel et al., 2019). When treatment eﬀects are hetero-geneous, and observed characteristics are informative about individual level treatmenteﬀects, appropriately designed personalized policy can improve outcomes compared toa uniform policy. Examples include using neural networks to construct heterogeneouspricing functions from internet browsing data, allocating credit based on unconventionalcharacteristics like phone usage (Björkegren and Grissen, 2019), and predicting quality ofcontent on social media platforms using click-through rates. However, if an individual’streatment assignment impacts their utility, they may change their behavior to receive abetter treatment. This means that the distribution of observed characteristics dependson how treatments are personalized. For example, in response to heterogeneous pricingof consumer products, individuals may adjust device usage to receive cheaper prices. Inresponse to heterogeneous treatment of media content on social media platforms basedon engagement metrics, some publishers may purchase fake clicks and comments, whileothers may invest more to improve the quality of their content. Compared to whenobserved covariates are exogenous, manipulation can make characteristics a less reliablebasis for allocating treatments, see Ball (2020), while other behavioral changes mayhave a positive impact on the planner objective. For example, Jin and Vasserman (2019)show that drivers who opt-in to a program that makes insurance discounts dependenton driving behavior exhibit safer driving after the monitoring is installed.In this paper we study the problem of estimating the parameters of a function thatassigns a continuous-valued treatment based on observed covariates, when agents maybehave strategically in response to receive a better treatment. The optimal personalizedpolicy is deﬁned as the treatment allocation function that maximizes a speciﬁed plannerobjective.Standard statistical approaches to address this problem fall short. Empirical riskminimization approaches, which are standard in industry, assume that observed charac-teristics are exogeneous, and fail to model that the distribution of observed data willshift in response to changes in the treatment allocation function. This paper formalizes atype of Lucas Critique (Lucas et al., 1976) for causal inference. An online seller may beinterested in exploiting correlation between browsing history and willingness to pay fora consumer good. However, switching from a uniform pricing policy to one that variespricing for diﬀerent individuals based on their browsing history introduces incentivesfor individuals to mimic the browsing history of an individual with low willingness topay, and receive a lower price. As a result, a policy that switches from uniform toheterogeneous pricing may not raise as much revenue as expected, unless the impact of trategic responses to price discrimination is taken into account.Ex-ante it is challenging for an economist to specify what kind of strategies individualsmay employ in diﬀerent settings. As a result, there is a need for an estimation methodthat takes into account the endogeneity of observed characteristics to treatment allocationbut does not require a parametric model of individual strategic behavior. We proposeestimating the optimal personalized policy using a dynamic experiment that randomizeshow treatments depend on observed covariates and takes gradient steps towards theoptimum. This procedure converges to the optimal policy in the limit and has regretthat decays at a linear rate. Furthermore, it is robust to diﬀerent models of individualstrategic behavior, since we assume only that the planner’s objective is concave in theparameters of the treatment assignment function. A key implication for experimentdesign is that, for this problem, randomizing treatment assignment only does not providesuﬃcient information to estimate the optimal policy. Random variation in how treatmentassignment depends on observed characteristics is required.After a brief review of the related literature, we describe formally the problem ofa planner who would like to allocate a continuous treatment w i optimally using someparametric function of observed individual characteristics x i . Individual outcomes y i are observed post-treatment and may depend on the treatment received. Individualshave some unobserved multidimensional type θ i which inﬂuences their outcome y i andhow the individual reports x i in order to receive a better treatment. The goal is to ﬁndthe treatment allocation function that maximizes the expected value of some observablefunction of the treatment and outcome for each individual. This objective may be arevenue or engagement metric for a company, or a welfare objective for a government.We then present in detail two special cases of the general model presented. The ﬁrstis the setting of strategic classiﬁcation, see Hardt et al. (2016), Perdomo et al. (2020),Björkegren et al. (2020), and Frankel and Kartik (2020). In the strategic classiﬁcationsetting, the treatment is the prediction of an outcome y i using reported characteristics x i .In the absence of incentives to manipulate, the reported characteristics x i are correlatedwith the outcome y i . The planners’ goal is to minimize the classiﬁcation error whenagents have the incentive to manipulate x i to receive a higher prediction. This exampleshows that the methodology introduced solves the version of the problem that has beenstudied most frequently in the literature.The second example is that of a proﬁt maximizing seller of a consumer good whowould like to price discriminate based on observed characteristics x i . The treatmentis the price shown to each individual and the outcome is how much that individualpurchases. This example shows how the method in this paper applies to a variety ofmore general optimal policy problems with distinct treatments, planner objectives, andagent behavior compared to the strategic classiﬁcation setting. For example, in the rice discrimination setting, the planner objective is revenue maximization, and boththe outcome and the observed characteristics are endogenous to the treatment received.We next describe formally the algorithm that solves for the optimal personalizedpolicy over time. We assume at each step in time, a diﬀerent batch of n agents arrive, soagents optimize a static decision problem. At each step, a noisy estimate of the gradientof the objective function is estimated by using small perturbations to randomize howthe treatment depends on observed characteristics for each individual, and observinghow the objective function is aﬀected by the perturbations. In the price discriminationexample, some customers would receive a price that is more dependent on their browsinghistory, and others would receive a price that is slightly less dependent on their browsinghistory. The relationship between these perturbations and the revenue received fromeach customer leads to a consistent estimate of the gradient of the objective. Over time,the planner uses these gradient estimates to take steps towards the optimal personalizedpolicy. When the objective is strongly concave, the algorithm is guaranteed to convergeto the global maximum. We prove that the regret of the proposed algorithm decays at alinear rate. Furthermore, under monotonicity assumptions on the treatment allocationfunction and the agents’ strategic reporting function, we prove that any ﬁxed pointsof the repeated risk minimization approach, analyzed in Frankel and Kartik (2020)and Perdomo et al. (2020) in the context of the strategic classiﬁcation problem, aresub-optimal.After evaluating the theoretical properties of our proposed approach, we return tothe two main examples of price discrimination and strategic classiﬁcation. We show thatin simulations, the iterative experimental approach converges to the optimal predictionfunction and the optimal price discrimination function, and has negligible average regret.In contrast, the risk-minimization based approaches have signiﬁcantly lower revenue andhigher MSE. In the ﬁnal section of the paper we report the results of a small MTurkexperiment. In the experiment, we predict a respondent’s self-reported income fromtheir self-reported age, education, and car ownership. We introduce incentives, througha variable bonus payment disclosed on the second page of the survey, for individuals tomanipulate their self-reported education and car ownership in order to receive a higherpredicted income. We ﬁnd that a risk-minimization approach leads to higher out ofsample mean-squared error compared to a prediction function derived from our proposediterative algorithm, which downweights characteristics susceptible to manipulation andupweights characteristics that are not as susceptible to manipulation. Literature Review

The proposed algorithm applies to the problem of strategicclassiﬁcation analyzed theoretically by both computer scientists and economic theorists(Perdomo et al., 2020; Frankel and Kartik, 2020; Ball, 2020), as well as to price dis- rimination (Varian, 1989) and heterogeneous taxation based on manipulable covariates(Roberts, 1984). A contribution of this paper is introducing a general methodologyfor estimating personalized policy, so the solution applies to a variety of treatmentallocation problems studied in the theory literature, with diﬀerent planner objectivesand models of strategic behavior.There is existing work in the computer science literature on estimation methodsfor the strategic classiﬁcation problem: Hardt et al. (2016) provides a near-optimalclassiﬁer based on an assumed cost of gaming and Dong et al. (2018) provides anestimation method based on zero-th order optimization in an online setting when asingle agent arrives at each step. The most closely related paper is Björkegren et al.(2020), who introduce and test empirically an estimation method that solves for theoptimal prediction rule with strategic agents. Björkegren et al. (2020) use a structuralmodel for the estimation, which allows for both estimating the optimal policy andquantifying strategic behavior under various counterfactual policies, but requires makingparametric assumptions on the agents’ costs and beneﬁts of manipulating their behavior.In contrast, the method in this paper is designed for settings where estimating theoptimal personalized policy robustly is more important than counterfactual analysis. Asa result, our method converges to the optimal personalized policy without parametricassumptions on the agents’ strategic behavior, but is not designed for counterfactualanalysis away from the optimum.The paper is also related to previous work on causal inference without strategicagents. The literature on empirical welfare maximization (EWM) studies the problemof estimating the optimal treatment assignment function based on covariates, in bothexperimental and observational settings, see Manski (2004), Kitagawa and Tetenov(2018), Kallus and Zhou (2020), Athey and Wager (2020), among many others. Thispaper provides an estimation method for the treatment allocation problem with lowregret under a non-standard assumption, which is that pre-treatment covariates areendogenous to how treatments are allocated.Among econometricians, there has been recent interest in applying adaptive exper-iments to select treatments, usually in settings with discrete-valued treatments, seeKasy and Sautmann (2019), Hadad et al. (2019), and the large theoretical literature onmulti-armed bandits. This paper introduces a type of adaptive experiment for allocatingcontinuous treatments based on observed covariates.The proposed algorithm is related to the literature on derivative-free optimization,where a convex function is optimized by observing sequential function evaluations, seeSpall (2005) for a basic overview, and Duchi et al. (2015) for analysis of a method basedon multiple function evaluations. We use a similar perturbation technique as Wagerand Xu (2019), who use an experimental approach to optimize a ﬁxed price when there re equilibrium eﬀects in a one-sided market.More broadly, this work is related to the suﬃcient statistics approach of Chetty(2009), who shows that certain welfare analyses can be performed with estimates ofa few key derivatives, and without full speciﬁcation of a structural model. It is alsorelated to the literature on robust mechanism design, where optimal mechanisms arederived when there is some uncertainty over the form of agent behavior, see Bergemannand Morris (2013). The framework considered is as follows. Individuals have three sets of characteristics:• Pre-treatment characteristics x i ∈ R d x , which are observed before the treatment isallocated.• Outcomes y i ∈ R d y , which are observed after the treatment is allocated.• Unobserved i.i.d. type θ i ∈ Θ ∼ G , where G is unknown to the planner.The treatment w i = w ( x i ; β i ) is the outcome of a known continuous function w ,parameterized by β i ∈ R k , that depends on observed pre-treatment characteristics. w ( x ; β i ) : R d x × R k → R β i may be stochastic, which allows for the planner to run experiments that vary howthe treatment is allocated to diﬀerent individuals. Each individual’s outcome dependson their treatment w i and their type θ i : y i = y ( w i , θ i ) In some settings the continuous function y is known, while in other settings it must beestimated. This function deﬁnes the potential outcomes for each individual, see Imbensand Rubin (2015), as y i ( w i ) = y ( w i , θ i ) . The key diﬀerence from a standard causalinference setting, however, is that individuals strategically report x i . x i is the result ofthe maximization of a utility function U , which is unknown to the planner: x i = x ( β i , θ i ) = arg max x U ( w i ( x ; β i ) , x, y i , θ i ) Since U is unknown to the planner, then the function x is unknown as well. Althoughwe allow the planner can randomize β across individuals in the learning process, thegoal of the planner is to choose the ﬁxed β that optimizes an objective. The objectiveis constrained to take the form of the expected value of a known function π ( w i , y i ) of ach individual’s treatment and outcome. Most typical objectives in this setting areof this form. The objective may be as simple as maximizing the expected outcome E [ y i ( w i )] . In other cases, such as if the planner is a platform designer, the objectivemay be slightly more complex, such as revenue maximization in the price discriminationsetting, or prediction error minimization in the strategic classiﬁcation setting. Theobjective function Π( β ) is deﬁned as follows: Π( β ) = E θ i ∼ G (cid:104) π (cid:16) w (cid:0) x ( β, θ i ); β (cid:1) , y (cid:0) w ( x ( β, θ i ); β ) , θ i (cid:1)(cid:17)(cid:105) (1)The eﬀect of β on the objective function is complex. First, there is the direct eﬀecton the treatment allocations w i and the outcomes y i through the treatment allocation.There is also an indirect eﬀect, however, since the distribution of observed characteristics x i depends on β . With a single sample of data ( w i , y i , x i ) i =1 ,...,n , the planner canevaluate the empirical version of the true objective function. We deﬁne this for a sampleof n individuals with treatments w i = w ( x i ; β ) as: Π n ( β ) = 1 n n (cid:88) i =1 π ( w i , y i ) with lim n →∞ Π n ( β ) = Π( β ) . The planner would like to set β = β ∗ , which arethe parameters for the treatment assignment function w that optimize the planner’sobjective: β ∗ = arg max β Π( β ) In the next subsection, we will specify assumptions such that β ∗ is unique andintroduce our algorithm for estimating β ∗ . First, we show how this framework appliesto two concrete examples: the problem of strategic classiﬁcation and the problem ofoptimal price discrimination based on observed characteristics. Example 1. Strategic Classiﬁcation

The ﬁrst example is a social media operator who would like to predict content quality( y i ) from click-through rates, which are the observed x i . A higher predicted contentquality rewards sellers through better placement in a users feed, for example. As a result,sellers may purchase fake clicks. The goal is to come up with an optimal predictionfunction when sellers manipulate their engagement metrics to receive a better qualityprediction. We assume that the planner uses a linear prediction function of y i , so w ( x i ; β ) = β + β x i . A content producer’s unobserved type is θ i = [ z i , γ i ] , where γ i istheir manipulation ability and z i is the engagement metric of the post in the absence ofmanipulation. In this case, as is common in the strategic manipulation literature, weassume that the treatment w i does not aﬀect the true quality y i , which we can assumeis observed sometime after the prediction about the content’s quality is made. he planner’s objective is to minimize the squared error from predicting y i using x i . β ∗ = arg max β E θ i ∼ G (cid:2) ( y i − x ( β, θ i ) β ) (cid:3) If x i is assumed to be exogenous, then a sub-optimal β will be chosen, as shown inFrankel and Kartik (2020) under a parametric assumption on x i ( β ) . The methoddescribed in this paper can be applied to ﬁnd the optimal prediction function β ∗ in thissetting without making any parametric assumptions on x i ( β ) . Example 2. Price Discrimination

In this example, the planner is a company that is selling insurance at a ﬁxed rateper dollar of coverage. The company observes customer search history metric x i ∈ R ,which is correlated with the customer’s value for the insurance policy in the absence ofmanipulation. The unobserved type θ i = [ v i , γ i , z i ] determines the customer’s valuationfor insurance ( v i ), their search behavior in the absence of manipulation ( z i ), and theirmanipulation ability ( γ i ). The treatment w i is the price oﬀered to customer i . Givena price w i , individual’s demands a certain amount of coverage y ( w i , θ i ) . The plannerchooses a linear pricing policy w i = p + p x i , so that β = [ p , p ] . Every customer ischarged a ﬁxed price p plus a variable price p x i that depends on their search history.The planner’s goal is to set β to maximize their expected revenue across all customertypes, without observing θ i or knowing the functional form of x or y . β ∗ = [ p ∗ , p ∗ ] = arg max E θ i ∼ G [( p + p x ( β, θ i )) · y ( w i , θ i )] We have now introduced a model that describes the environment in which the plannerwould like to determine the optimal personalized policy. Before exploring methodsto optimize in this environment, we ﬁrst make some basic assumptions on the datagenerating process, which ensures that the optimal parameterization of the treatmentallocation function, deﬁned as β ∗ , is unique. Assumption 1.

The functions w ( x i ; β ) , x ( β i , θ i ) , y ( w i , x i ) and π ( w i , y i ) are continu-ously diﬀerentiable. θ i is a continuous random variable. Lemma 1.

Given Assumption 1, Π( β ) is continuously diﬀerentiable.Assumption 1 implies that Π( β ) is a continuously diﬀerentiable function so that thegradient exists for all β ∈ R k . The gradient is deﬁned as follows: Π( β ) = E (cid:20)(cid:18) ∂π∂w + ∂π∂y ∂y∂w (cid:19) (cid:18) ∂w∂x ∂x∂β + ∂w∂β (cid:19)(cid:21) To ensure that Π( β ) has a unique maximizer, we make a stronger assumption: Assumption 2. Π( β ) is σ − strongly concave.Under Assumption 2, then β ∗ is unique, so it is the unique solution to ∇ Π( β ∗ ) = 0 .We brieﬂy review two standard approaches for ﬁnding the solution to this equation andindicate why they are insuﬃcient in this case. Mechanism Design Approach

Both π and w are speciﬁed by the planner, so theirpartial derivatives are known. If appropriate assumptions are made on the functionalforms of the functions y and x , as well as the distribution of θ i , then it is possible tocalculate the solution to ∇ Π( β ) = 0 directly through standard numerical optimizationof the non-linear function Π( β ) . In the economics literature, it is common to makeassumptions on strategic behavior so that β ∗ has an analytical solution, see Frankel andKartik (2020). While this is useful for gaining economic intuition about how incentives tomanipulate data aﬀect the optimal treatment allocation, if any of the assumptions aboutstrategic behavior or the distribution of unobserved types are incorrect, then in practicethe estimated β will be sub-optimal. Since it is challenging to make assumptions onstrategic behavior and unobserved characteristics that hold across multiple environments,we focus our attention on approaches that do not require such restrictions. Risk Minimization Approach

An alternative which requires less assumptionsand is more common in practice is an empirical risk minimization approach. Theempirical gradient for a batch of n agents is: ∇ Π n ( β ) = 1 n n (cid:88) i =1 (cid:20)(cid:18) ∂π i ∂w + ∂π i ∂y ∂y i ∂w (cid:19) (cid:18) ∂w i ∂x ∂x i ∂β + ∂w i ∂β (cid:19)(cid:21) In a standard risk minimization approach, the observed characteristics x i are treated asexogenous. Estimating the derivative of the outcome y with respect to the treatment w is standard in the literature on causal inference and is not the focus of this paper. As aresult, we assume the planner has a good estimate of the individual treatment eﬀects ∂y i ∂w , for example through an experiment that randomizes w i . Then, the empirical riskminimization approach would assume ∂x i ∂β = 0 , and ﬁnd the solution, which we will call ˜ β , to the following equation: g ( ˜ β ) = 1 n n (cid:88) i =1 (cid:20)(cid:18) ∂π i ∂w + ∂π i ∂y ∂y i ∂w (cid:19) (cid:18) ∂w i ∂β (cid:19)(cid:21) s long as the treatment eﬀect of w i on the outcomes y i does not depend on β , thenit is possible to ﬁnd ˜ β using a single sample of data. The derivatives of π and w areknown and depend on observed data, and we assume we have a good estimate of thederivatives of y . However, unless E (cid:104) ∂w i ∂x ∂x ( β ∗ ,θ i ) ∂β (cid:105) = 0 , which is not in general true, then β ∗ (cid:54) = ˜ β , so strategic reporting of characteristics aﬀects the optimal policy function, anda new approach is needed to estimate β . Iterative Experiment

Given that the typical methods that might solve for β ∗ using a single sample of data do not meet the requirements of our setting, we insteadturn to an alternative approach that optimizes β over time. We assume that at eachtime t = 1 , . . . , T , a diﬀerent batch of n agents arrive and are treated by the planner.We do not observe agents repeatedly over time, so their decision problem is static. Then,we can approximate β ∗ by starting from an initial naive estimate of ˆ β , perturbing β in a zero-mean way and observing how the individual objective values π i = π ( w i , y i ) correlate with the perturbations of β , and take gradient steps towards the optimalpolicy. This is inspired by the approach in Wager and Xu (2019) that ﬁnds the optimalﬁxed price in a one-sided market with equilibrium eﬀects. The experiment is describedformally in Algorithm 1. Algorithm 1:

Online Experiment for Optimizing Personalized Policy

Input:

Initial estimate ˆ β , sample size n , step size η , perturbation α , and steps T Output:

Updated estimate ˆ β T t = 1 ; K = dim ( ˆ β ) ; while t ≤ T do New batch of n agents arrive, agent i has unobserved type θ ti ∈ Θ ; for i ∈ { , . . . , n } do Sample (cid:15) i randomly from {− , } K ;Announce β i = ˆ β t − + α(cid:15) i ;Agent reports x ti = x ( β i , θ ti ) ;Treat agent with w ti = w ( x ti , β i ) ;Agent reports outcome y ti = y ( w ti , θ ti ) ;Calculate objective value π ti = π ( w ti , y ti ) ; end Q t = α(cid:15) t is the n × K matrix of perturbations; π t is the n -length vector of objective values;Run OLS of π t on Q t : ˆΓ t = ( Q (cid:48) t Q t ) − ( Q (cid:48) t π t ) ; ˆ β t = ˆ β t − + 2 η ˆΓ t t +1 ; t ← t + 1 ; endreturn ˆ β T his algorithm estimates ˆ β T in T steps without the planner specifying any func-tional form for how individuals report observed characteristics x ( β, θ i ) , or how theoutcomes depend on the treatment y ( w i , θ i ) , which are the two unknown processes inour environment. The proposed experiment has two notable characteristics. First, it isdynamic; without strict assumptions on how individuals manipulate characteristics, it isnot possible to optimize Equation 1 using a single sample of data without an infeasiblylarge experiment size that varies β globally rather than locally. Second, in contrast totraditional experiment approaches, the design perturbs β i , rather than w i . Randomizingthe treatment alone does not allow the planner to observe how the dependence of thetreatment allocation on observed characteristics aﬀects the objective, which is requiredto optimize the objective in this setting. In this section, we analyze the convergence of Algorithm 1 to the optimum and compareits performance to alternative approaches. The ﬁrst result is that the estimate ˆΓ t fromeach step of the perturbation experiment converges in probability to the true gradient ∇ Π( β ) as the sample size in each step grows large. This result relies on the perturbationsize going to zero as n → ∞ at a suﬃciently slow rate. Theorem 1.

Fix some ˆ β t . If the perturbation size α n = αn − b for < b < . , then ˆΓ t from Algorithm 1 converges to the k -dimensional gradient of the objective: lim n →∞ P (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ t − ∇ Π( ˆ β t ) (cid:12)(cid:12)(cid:12) > (cid:15) (cid:17) = 0 for any (cid:15) > .The proof, in Appendix A, uses a Taylor expansion of the objective, LLN andSlutsky’s theorem to show that ˆΓ t converges to the centered diﬀerence approximation ofthe total derivative of the objective with respect to the policy parameter.The average regret of a policy in place for T time periods is the average diﬀerence inthe objective function between the realized policy path and the policy that maximizesthe average objective value over the T time periods. In Example 2, the average regretof Algorithm 1 corresponds to the average revenue loss of a policy that learns throughan iterative experiment compared to the revenue of a planner with full information whoimplements the optimal policy immediately from the ﬁrst step. Theorem 2.

Under Assumption 2, also assume the norm of the gradient of Π is boundedby M , so || Π (cid:48) ( β ) || ≤ M , and the step size η > σ − . If a planner runs Algorithm 1 for time periods, then we have that the regret decays at rate O (1 /t ) , so for any β ∈ R k : lim n →∞ P (cid:34) T T (cid:88) t =1 t (Π( β ) − Π( ˆ β t )) ≤ ηM (cid:35) = 1 A corollary of this is that the the procedure converges to β ∗ in probability at a linearrate as the sample size at each time step grows large. Corollary 1.

Under the conditions of Theorem 2, lim n →∞ P (cid:20) || β ∗ − ˆ β T || ≤ ηM σT (cid:21) = 1 The proof of Theorem 2 and Corollary 1 is in Appendix A. Theorem 1 showsthat gradient estimate is consistent. The proof of Theorem 2 then applies results forconvergence of gradient descent when a consistent, but not necessarily unbiased, gradientoracle is available.One caveat is that the strong-concavity of the objective function in β is not somethingthat can be veriﬁed by the planner in advance. If the objective function is not stronglyconcave, then there may be multiple local maxima in the objective. Algorithm 1 maythen converge to a critical point of the objective, which may be a local maximizer or asaddle point rather than the global maximizer of the objective. Although the theoreticalresults are weaker, in practice gradient descent approaches have been used successfullyin the optimization of non-convex objective functions, for example in the training ofneural networks.The combination of these two results indicates that the suggested dynamic experimentsuccessfully recovers β ∗ . In contrast, risk minimization based approaches do not convergeto β ∗ . We argued informally in Section 2.1 that an empirical risk minimization approachbased on a single sample of data results in a suboptimal treatment allocation policy.Rather than performing risk minimization once, it is common to conduct repeated riskminimization, which is described formally in Algorithm 2. To isolate the issues withignoring strategic reporting of x i , we assume when conducting repeated risk minimizationthat the planner has some way, for example through randomized assignment of w i or anex-ante correct model of y ( w i , θ i ) , of estimating the Individual Treatment Eﬀect (ITE) ∂y i ∂w . Repeated risk minimization can be modeled as an adversarial game, where the gents choose a distribution of characteristics based on the planner’s policy, and thenthe planner chooses their policy based on the latest set of observed characteristics. Algorithm 2:

Repeated Risk Minimization

Input:

Initial estimate ˆ β , sample size n , steps T , procedure to estimate ∂y i ∂w Output:

Updated estimate ˜ β T while t ≤ T do New batch of n agents arrive, agent i has unobserved type θ ti ∈ Θ ;Announce ˜ β t − ; for i ∈ { , . . . , n } do Agent reports x ti = x ( ˜ β t − , θ ti ) ;Treat agent with w ti = w ( x ti ; ˜ β t − ) ;Agent reports outcome y ti = y ( w ti , θ ti ) ;Calculate objective value π ti = π ( w ti , y ti ) ; end Deﬁne w i ( ˜ β t ) = w ( x ti ; ˜ β t ) , y i ( ˜ β t ) = y ( w i ( ˜ β t ) , θ i ) and π i ( ˜ β t ) = π ( w i ( ˜ β t ) , y i ( ˜ β t )) ;Given estimate of ∂y i ( ˆ β t ) ∂w ; ˜ β t solves: N N (cid:88) i =1 (cid:34) ∂π i ( ˜ β t ) ∂w + ∂π i ( ˜ β t ) ∂y ∂y i ( ˜ β t ) ∂w (cid:35) ∂w i ( ˜ β t ) ∂β endreturn ˜ β T There are a variety of existing results on repeated risk minimization from the strategicclassiﬁcation literature. Perdomo et al. (2020) prove that as long as the distribution ofobserved characteristics does not respond too much to changes in the policy, then riskminimization converges to a ﬁxed point. Furthermore, the diﬀerence in the optimality ofthis ﬁxed point compared to the global optimum can be bounded by this sensitivity ofthe distribution of the observed characteristics to the policy. In a more speciﬁc settingwith functional form restrictions on agents’ response functions, Frankel and Kartik(2020) prove that any ﬁxed point is strictly suboptimal to the globally optimal β . It isworth examining how these results extend to our more general setting of personalizedpolicy. Without further restrictions on the data generating process, there may not be aunique ﬁxed point or a ﬁxed point at all. If there is a ﬁxed point of Algorithm 2, it canbe deﬁned as follows. Let w i ( β F P ) = w ( x ( β F P , θ i ); β F P ) , y i ( β F P ) = y ( w i ( β F P ) , θ i ) and π i ( β F P ) = π ( w i ( β F P ) , y i ( β F P )) . Then β F P is any solution to the following equation: lim n →∞ N N (cid:88) i =1 (cid:20) ∂π i ( β F P ) ∂w + ∂π i ( β F P ) ∂y ∂y i ( β F P ) ∂w (cid:21) ∂w i ( β F P ) ∂β = 0 (2) ith some additional assumptions on the treatment allocation function w and thestrategic reporting function x , we can show that any ﬁxed point of a repeated riskminimization procedure is suboptimal in a personalized policy setting. Theorem 3.

Under the conditions of Theorem 2, also assume that β ∗ (cid:54) = 0 , that w ( x i ; β ) is strictly monotonic in x i for β (cid:54) = 0 , and for all θ i we have that x ( β, θ i ) is strictlymonotonic in β for some region of the distribution of θ i that occurs with non-zeroprobability. Then, we have that Π( β F P ) < Π( β ∗ ) The proof, in Appendix A, is a simple proof by contradiction showing that β ∗ , whichsets the limit of ∇ n Π( β ) to zero and the β F P , which sets the limit of Equation 2 tozero, cannot be equal. By the strong-convexity of Π , this implies that Π( β F P ) < Π( β ∗ ) .The monotonicity assumptions are natural in many settings with personalized policy.For example, in the price discrimination setting, a linear pricing rule is monotonic ineach of the observed characteristics. In addition, most economic models of manipulation,for example see Frankel and Kartik (2020) and Roberts (1984), imply that an increasein incentives to manipulate a characteristic will result in a monotonic change in thereport of that characteristic for those with non-zero manipulation ability.The interpretation of Theorem 3 is straightforward. In an adversarial game betweenthe planner and strategic agents, the planner does not learn about strategic behavior.Instead, the planner simply reacts to the strategic behavior that occurs, and in somecases this can converge to a ﬁxed point. In contrast, in the iterative experiment ofAlgorithm 1, the planner perturbs the dependence of the treatment on the observedcharacteristics so that the relevant aspects of strategic behavior that impact the objectivefunction are learned over time. We next illustrate the theoretical results of this sectionin simulations based on Example 1 and Example 2. The ﬁrst simulation is of the strategic classiﬁcation setting, which has received asigniﬁcant amount of attention in the literature. In the strategic classiﬁcation literature,the target y i is exogenous to the treatment, which is the prediction w i . The approach wepropose in this paper applies to more general settings where the post-treatment outcomedepends on the treatment. This includes a wider variety of both private and public sectorpolicies. As a result, the second simulation is of the price discrimination setting, wherethe amount purchased y i depends on the treatment, which is the individual-speciﬁcprice w i . n both simulations, we examine the convergence and the regret of the estimatedpolicy over T periods with n agents at each step, for four diﬀerent approaches:• Full information benchmark: Calculate β ∗ based on full knowledge of the datagenerating process.• Iterative learning following Algorithm 1• Naive risk minimization: Assume x i is exogenous and estimate ˆ β based on sampleof data collected where individuals do not have an incentive to manipulate theirreports of x i .• Repeated risk minimization following Algorithm 2 We impose the following assumptions on the framework in Example 1, where the planner’sgoal is to ﬁnd the MSE-minimizing prediction function. The assumptions imposed leadto a similar data generating process as the one analyzed in Frankel and Kartik (2020).First we assume that the unobserved type θ i = [ z i , γ i , r i ] has the following distribution: z i ∼ Normal (0 , , γ i ∼ Uniform (0 , . , r i ∼ Normal (0 , The reported characteristics are the result of the maximization of a linear - quadraticutility function that is increasing in the quality prediction w i = β + x i β and decreasingin the amount of manipulation of the click-through-rate: x i = arg max ˜ x i ˜ x i β − (˜ x i − z i ) γ i Solving this leads to a reporting function of x ( β, θ i ) = z i + γ i β . The outcome, whichis the realized content quality, is y ( θ i ) = z i + r i .Figure 1 plots the estimated β for 1000 periods in which a batch of 1000 agentsarrive in each period. β indicates how much the prediction of the outcome y i depends onthe reported x i . The purple line is the estimated β from OLS on a sample of data whereindividuals are not rewarded based on β x i . When agents are strategic, this coeﬃcientis too high. It places too much weight on x i , which is susceptible to manipulation. Thegreen line is the estimated β when the planner updates β at each step t using OLS,and agents best-respond in the next period. The prediction functions reached by thisadversarial game are still sub-optimal, with a higher realized mean-squared error thanthe full information solution. The blue line is the full information solution, where theplanner has access to the manipulation ability of each agent and can solve for β ∗ in onestep using non-linear optimization. The orange line is the experimental estimate of β ∗ , Method Average MSE

Full Information 1.1176Iterative Learning 1.1180Repeated Risk Minimization 1.1261Naive Risk Minimization 1.7448Table 1: Average MSE of Each Approach which uses mean-zero perturbation of β and observations of x i and y i to take gradientsteps towards the optimum. The experimental approach converges to the MSE-optimalfull knowledge solution without relying on any prior assumptions on the structure ofmanipulation. The risk minimization approaches induce a distribution of characteristicsthat is tainted by manipulation in a way that makes prediction more diﬃcult. Theiterative learning approach induces a distribution that is more amenable to prediction,leading to a lower average MSE over the 1000 periods compared to the risk minimizationapproaches, despite taking some time to converge to the optimal value, see Table 1. In Example 2, the planner would like to set the optimal linear pricing rule w i = w ( x i ; β ) to maximize revenue when agents may misreport x i to receive a better price. To generatedata in this setting, we impose assumptions on the distribution of the agents’ type θ i = [ v i , z i , γ i ] and the form of the agents’ utility function, which determines their ethod Regret Full Information 0.00Iterative Learning -0.25Repeated Risk Minimization -24.45Naive Risk Minimization -48.21Table 2: Average Revenue of Each Approach Compared to Full Information Optimum demand y i = y ( w i , θ i ) and their strategic responses x i = x ( β i , θ i ) . z i ∼ Uniform (10 , , v i ∼ Normal (5 + z i , , γ i ∼ Uniform (0 , We assume the following utility function, which is a function of an individual’s reportedcharacteristic x i , the amount purchased y i , the price w i , and the type θ i : u ( x i , y i , w i , θ i ) = y i ( v i − w i ) − y i − ( x i − z i ) γ i This leads to a demand that is linear in price: y ( w i , θ i ) = v i − w i Given the demand function, the optimal report of x i for the individual is: x ( β, θ i ) = z i − γ i p [ v i − p ]1 − p γ i An individual will shade their report of x i by an amount that depends on the degreeof price discrimination p and their manipulation ability γ i . Despite the more complexsetting, in Figure 2, the simulation shows a similar pattern to the strategic classiﬁcationsimulation. The chart plots p , which indicates how much the price shown to a customerdepends on x i , over 500 steps. In this case the repeated risk minimization approach doesnot converge to a ﬁxed point, and instead ﬂips back and forth between the naive riskminimization solution to a solution with close to zero price discrimination. As a result,it is omitted from the chart. The naive risk minimization approach price discriminatestoo much, not anticipating that the manipulation that occurs in response to this pricediscrimination leads to sub-optimal revenue. In contrast, Algorithm 1 converges to therevenue-optimal price discrimination function.The diﬀerence in average revenues over the course of the 500 periods of the simulationis signiﬁcant. In Table 2, we show that the risk minimization approaches have a costthat is 100 times larger than the negligible diﬀerence in average revenues between theiterative learning and full information approach. We also evaluate the impact of the methodology in a small experiment inspired by thesetting in Example 1. We set up a Qualtrics survey on Amazon Mechanical Turk, whereon Page 1, the respondent is asked to self-report their income and their age range. FromPage 2, it is not possible to return to Page 1. On Page 2, the respondent self-reportstheir education and whether or not they own a car.For the baseline survey, there is no incentive to misreport, and respondents are paida small fee to complete the survey. For the experimental steps, respondents are paid thesame ﬁxed fee plus a variable bonus equal to the respondents’ rescaled predicted income.The bonus is not mentioned on page 1, so that there are no incentives to misreportincome or age. Respondents are made aware of the bonus’ value and that it dependson their predicted income on Page 2, which may impact reports of education and carownership. The bonus value dynamically updates as individuals choose their responseson Page 2. See Appendix B for an image of the survey Page 1, which has the appearanceof a normal demographic survey, and of the incentives to manipulate introduced onsurvey Page 2. The prediction rule is linear, and the objective is to ﬁnd the β thatminimizes expected mean squared error of the prediction of the respondents’ incomes,when the distributions of Educ and Car may depend on β :Income i = β + β Age i + β Educ ( β ) i + β Car ( β ) i + (cid:15) i e estimate two sets of coeﬃcients. The ﬁrst follows a risk minimization approach. ˆ β is the OLS estimate on a sample of 100 individuals whose data is collected fromthe baseline survey without an incentive to manipulate. ˆ β = ˆ β − η ˆΓ is the estimatedcoeﬃcients from a single step of the iterative learning algorithm. ˆΓ is estimated fromzero-mean perturbation of ˆ β in a sample of 400 individuals who receive a bonus ˆ β x i .This means some individuals receive a prediction rule that is slightly more dependenton observed characteristics than the OLS rule, and some individuals receive a predictionrule that is slightly less dependent on observed characteristics than the OLS rule. ˆ β ˆ ∇ Π( β ) ˆ β Intercept -31.43 (26.2) -30.96

Age

Education

Car β In Table 3, the ﬁrst column reports the OLS coeﬃcients and standard errors, esti-mated from the baseline sample of 100 coeﬃcients. As expected, all three characteristicsare useful in predicting income in the absence of incentives for individuals to misreportcharacteristics. The second column reports the estimated gradients derived from a singlestep of Algorithm 1. In the third column, we report the updated prediction function,where we used a characteristic-speciﬁc step size. The theoretical results in Frankeland Kartik (2020) and Ball (2020) indicate that under certain assumptions on agentbehavior, the prediction-optimal coeﬃcient on characteristics with a high variance inmanipulation ability across individuals will have a lower weight compared to the OLScoeﬃcient. Since we provide incentives to manipulate Car and Education but not Age,we might expect that the prediction rule would react to strategic behavior by increasingthe coeﬃcient on Age, and decreasing it on Education and Car. In the single gradientstep conducted in the experiment, we do see the coeﬃcient on Car decrease and Ageincrease, although we also see the coeﬃcient on Education, which is a manipulablecharacteristic, increase.We then calculate out-of-sample MSE in three diﬀerent ways.1. Risk Minimization: Predict y i using w i = ˆ β x i for a sample of 400 individuals whoreport data without incentives to manipulate2. Naive Risk Minimization: Predict y i using w i = ˆ β x i for a sample of 400 individualswho receive a bonus ˆ β x i

3. Iterative Learning: Predict y i using w i = ˆ β x i for a sample of 400 individuals whoreceive a bonus ˆ β x i n = 400 for each) In Figure 3, we show that, as expected, the out of sample MSE is lowest underthe risk minimization approach when individuals are not incentivized to misreporttheir characteristics. The MSE is highest under the naive risk minimization approach,when OLS, which incorrectly assumes reported characteristics are exogenous, is used toestimate the prediction function. The out of sample MSE takes an intermediate valueunder the iterative learning approach, when some of the impact of strategic responses istaken into account when formulating the prediction rule.

When a planner treats individuals in a heterogeneous way based on some observedcharacteristics about that individual, incentives are introduced for individuals to ma-nipulate their behavior to receive a better treatment. We have shown theoretically,in simulations, and in practice that this impacts how treatments should be optimallyallocated based on observed individual level data. We propose an iterative method thatconverges to the optimal treatment assignment function, without making parametricassumptions on the structure of individuals’ strategic behavior. The key to the successof this method is the dynamic approach, and randomizing how the treatment dependson observed characteristics rather than randomizing the treatment itself. There is avariety of potential future work involving the combination of economic models withadvances in stochastic optimization to design policy that adjusts optimally withoutstrict assumptions on the environment. eferences Athey, Susan and Stefan Wager , “Policy Learning with Observational Data,” 2020.

Ball, Ian , “Scoring Strategic Agents,”

Job Market Paper , 2020.

Bergemann, Dirk and Stephen Morris , “Robust predictions in games with incom-plete information,”

Econometrica , 2013, (4), 1251–1308. Björkegren, Daniel and Darrell Grissen , “Behavior revealed in mobile phone usagepredicts credit repayment,” 2019. , Joshua E Blumenstock, and Samsun Knight , “Manipulation-Proof MachineLearning,” arXiv preprint arXiv:2004.03865 , 2020.

Chetty, Raj , “Suﬃcient statistics for welfare analysis: A bridge between structuraland reduced-form methods,”

Annu. Rev. Econ. , 2009, (1), 451–488. Dong, Jinshuo, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhi-wei Steven Wu , “Strategic classiﬁcation from revealed preferences,” in “Proceedingsof the 2018 ACM Conference on Economics and Computation” 2018, pp. 55–70.

Duchi, John C, Michael I Jordan, Martin J Wainwright, and AndreWibisono , “Optimal rates for zero-order convex optimization: The power of twofunction evaluations,”

IEEE Transactions on Information Theory , 2015, (5), 2788–2806. Frankel, Alex and Navin Kartik , “Improving Information from Manipulable Data,” arXiv preprint arXiv:1908.10330 , 2020.

Hadad, Vitor, David A Hirshberg, Ruohan Zhan, Stefan Wager, and SusanAthey , “Conﬁdence intervals for policy evaluation in adaptive experiments,” arXivpreprint arXiv:1911.02768 , 2019.

Hardt, Moritz, Nimrod Megiddo, Christos Papadimitriou, and Mary Woot-ters , “Strategic classiﬁcation,” in “Proceedings of the 2016 ACM conference oninnovations in theoretical computer science” 2016, pp. 111–122.

Imbens, Guido W and Donald B Rubin , Causal inference in statistics, social, andbiomedical sciences , Cambridge University Press, 2015.

Jin, Yizhou and Shoshana Vasserman , “Buying Data from Consumers,” 2019.

Kallus, Nathan and Angela Zhou , “Minimax-Optimal Policy Learning Under Un-observed Confounding,”

Management Science , 2020.

Kasy, Maximilian and Anja Sautmann , “Adaptive treatment assignment in exper-iments for policy choice,” 2019.

Kitagawa, Toru and Aleksey Tetenov , “Who should be treated? empirical welfaremaximization methods for treatment choice,”

Econometrica , 2018, (2), 591–616. ünzel, Sören R, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu , “Metalearnersfor estimating heterogeneous treatment eﬀects using machine learning,” Proceedingsof the national academy of sciences , 2019, (10), 4156–4165.

Lucas, Robert E et al. , “Econometric policy evaluation: A critique,” in “Carnegie-Rochester conference series on public policy,” Vol. 1 1976, pp. 19–46.

Manski, Charles F , “Statistical treatment rules for heterogeneous populations,”

Econo-metrica , 2004, (4), 1221–1246. Nie, Xinkun and Stefan Wager , “Quasi-oracle estimation of heterogeneous treatmenteﬀects,” arXiv preprint arXiv:1712.04912 , 2017.

Orabona, Francesco, Koby Crammer, and Nicolò Cesa-Bianchi , “A GeneralizedOnline Mirror Descent with Applications to Classiﬁcation and Regression,” 2014.

Perdomo, Juan C, Tijana Zrnic, Celestine Mendler-Dünner, and MoritzHardt , “Performative prediction,” arXiv preprint arXiv:2002.06673 , 2020.

Roberts, Kevin , “The theoretical limits to redistribution,”

The review of economicstudies , 1984, (2), 177–195. Spall, James C , Introduction to stochastic search and optimization: estimation, simu-lation, and control , Vol. 65, John Wiley & Sons, 2005.

Varian, Hal R , “Price discrimination,”

Handbook of industrial organization , 1989, ,597–654. Wager, Stefan and Kuang Xu , “Experimenting in equilibrium,” arXiv preprintarXiv:1903.02124 , 2019. and Susan Athey , “Estimation and inference of heterogeneous treatment eﬀectsusing random forests,”

Journal of the American Statistical Association , 2018, (523), 1228–1242. Proofs

A.1 Proof of Theorem 1

Proof.

We can drop the t subscripts since we are ﬁxing ˆ β = ˆ β t . Let Q ∈ {− , } n × k bethe n × k matrix of un-scaled experimental perturbations, where Q ik = (cid:15) tik . Let π bethe n × vector of objective values as a function of each individual’s treatment andoutcome, where π i = π ( w i , y i ) . Since q ik is drawn i.i.d. for each i and each k , then E [ q ik q ij ] = 0 unless j = k , in which case E [ q ik ] = 1 . Since q ik is drawn randomly foreach individual, by the Law of Large Numbers, lim n →∞ Q (cid:48) Q n = E (cid:2) Q (cid:48) Q (cid:3) = I k , M = Q (cid:48) π is a k × vector, where M j = n (cid:88) i =1 π i q ik = (cid:88) i : q ik =1 π i − (cid:88) i : q ik = − π i Let e j be the length k basis vector with 1 at position j and 0 everywhere else. Let r j be a k length random variable with r jj = 1 and the other entries sampled randomlyand independently from {-1, 1}. Then, we have E r j [ ˆ β + α r j ] = ˆ β + α e j .Now, take a Taylor expansion of the loss function around ˆ β + α e j : E [ π i | q ij = 1] = E r j [Π( ˆ β + α r j )]= Π( ˆ β + α e j ) + ∇ Π( ˆ β + α e j ) α ( e j − E [ r j ])+ α E [( e j − r j ) (cid:48) ∇ Π( ˆ β + α e j )( e j − r j )] . . . The ﬁrst order term is zero, since e j = E [ r j ] . We can do a similar Taylor orderexpansion of the loss function around ˆ β − α e j , where the ﬁrst order term is also zero: E r j [Π( ˆ β − α r j )] = Π( ˆ β − α e j ) + α E [( e j − r j ) (cid:48) ∇ Π( ˆ β − α e j )( e j − r j )] . . . When we subtract the Taylor expansions from each other, the s -th even higher orderterms are simply a function of the diﬀerence of higher order derivatives of the objectivefunction evaluated at ˆ β + α e j compared to ˆ β − α e j , multiplied by α s .Denote the diﬀerence of these higher order terms as G . Since the derivatives of Π( β ) are bounded, we can bound the diﬀerence in these higher order terms above by someconstant α S and below by − α S where S ≤ ∞ . As a result, as α approaches zero,then Sα α and − Sα α goes to zero, which means that Gα also goes to zero. Note that wehave proved that: lim n →∞ M j n = lim α → . E [ π i | q ij = 1] − . E [ π i | q ij = −

1] = 0 . β + α e j ) − . β + α e j )+ G Then, we have that: Γ = α Q (cid:48) π α Q (cid:48) Q lim n →∞ ˆΓ = lim α → E [ M ] αI k Since the denominator is the identity matrix, can examine ˆΓ j separately for each j ∈ { , . . . , k } : lim n →∞ ˆΓ j = lim α → Π( ˆ β + α e j ) − Π( ˆ β − α e j )2 α + G α By the preceding discussion, the higher order term G α → . This leaves the formulafor the centered approximation to the derivative of the loss function with respect to the β j , evaluated at ˆ β , which converges to the true derivative as α → . So, the proof iscomplete. A.2 Proof of Theorem 2

Proof.

Follows the approach of the proof of Theorem 7 in Wager and Xu (2019). UseLemma 1 from Orabona et al. (2014) to show that: T (cid:88) t =1 t ( β − ˆ β t ) (cid:48) ˆΓ t ≤ η T (cid:88) t =1 t || β − ˆ β t || + η T (cid:88) t =1 || ˆΓ t || Then, we can replace the gradient estimate ˆΓ t with its limit value ∇ Π( ˆ β t ) and addan appropriate error term. T (cid:88) t =1 t ( β − ˆ β t ) (cid:48) ∇ Π( ˆ β t ) ≤ η T (cid:88) t =1 t || β − ˆ β t || + η T (cid:88) t =1 ||∇ Π( ˆ β t ) || + η T (cid:88) t =1 ( || ˆΓ t || − ||∇ Π( ˆ β t ) || ) + T (cid:88) t =1 t ( β − ˆ β t ) (cid:48) ( ∇ Π( ˆ β t ) − ˆΓ t ) From Lemma 1, we know that with probability approaching 1 as n → ∞ , we havethat for any (cid:15) > that : T (cid:88) t =1 t ( β − ˆ β t ) (cid:48) ∇ Π( ˆ β t ) ≤ η T (cid:88) t =1 t || β − ˆ β t || + η T (cid:88) t =1 ||∇ Π( ˆ β t ) || + (cid:15) Then, given that the gradient is bounded by M , we have that: T (cid:88) t =1 t ( β − ˆ β t ) (cid:48) ∇ Π( ˆ β t ) ≤ η T (cid:88) t =1 t || β − ˆ β t || + ηT M Next, we use the σ -strong concavity of Π , which implies that: ( β ) ≤ Π( ˆ β t ) + ( β − ˆ β t ) (cid:48) ∇ Π( ˆ β t ) + σ || β − ˆ β t || As well as the choice of σ > η − , to replace the LHS of the expression since we havethat: T (cid:88) t =1 t (Π( β ) − Π( ˆ β t )) + 12 η || β − ˆ β t || ≤ T (cid:88) t =1 t ( β − ˆ β t ) (cid:48) ∇ Π( ˆ β t ) This then gives the result: T T (cid:88) t =1 t (Π( β ) − Π( ˆ β t )) ≤ ηM with probability approaching 1 as n → ∞ . A.3 Proof of Corollary 1

From Theorem 2, we have that T T (cid:88) t =1 t (Π( β ) − Π( ˆ β t )) ≤ ηM with probability 1 as n → ∞ Using σ - strong-concavity of Π , and the fact that ∇ Π( β ∗ ) = 0 , we can rewrite thisas: σ T T (cid:88) t =1 t || β ∗ − ˆ β t || ≤ ηM Then, note that: T || β ∗ − ˆ β T || ≤ T (cid:88) t =1 t || β ∗ − ˆ β T || ≤ T (cid:88) t =1 t || β ∗ − ˆ β t || We can substitute this, which implies the result: σ T || β ∗ − ˆ β T || ≤ ηM A.4 Proof of Theorem 3

Proof.