[PDF] Optimal Policy Learning: From Theory to Practice

Abstract

Following in the footsteps of the literature on empirical welfare maximization, this paper wants to contribute by stressing the policymaker perspective via a practical illustration of an optimal policy assignment problem. More specifically, by focusing on the class of threshold-based policies, we first set up the theoretical underpinnings of the policymaker selection problem, to then offer a practical solution to this problem via an empirical illustration using the popular LaLonde (1986) training program dataset. The paper proposes an implementation protocol for the optimal solution that is straightforward to apply and easy to program with standard statistical software.

Full PDF

aa r X i v : . [ ec on . E M ] N ov Optimal Policy Learning:From Theory to Practice

Giovanni Cerulli

CNR-IRCRESResearch Institute on Sustainable Economic GrowthNational Research Council of [email protected]

November 11, 2020

Abstract

Following in the footsteps of the literature on empirical welfare maximiza-tion , this paper wants to contribute by stressing the policymaker perspectivevia a practical illustration of an optimal policy assignment problem. Morespeciﬁcally, by focusing on the class of threshold–based policies, we ﬁrst set upthe theoretical underpinnings of the policymaker selection problem, to then of-fer a practical solution to this problem via an empirical illustration using thepopular LaLonde (1986) training program dataset. The paper proposes an im-plementation protocol for the optimal solution that is straightforward to applyand easy to program with standard statistical software.

Keywords:

Policy learning, Optimal treatment, Program evaluation, threshold–based assignment rule

JEL Classiﬁcation:

C53, C61, C63 1

Introduction

Evidence-based program evaluation is increasingly becoming an essential tool for pol-icymakers to ﬁne–tune socio–economic policies (Cerulli, 2015). Generally, decision-makers run multiple rounds of similar policies at national, regional, or local level.As recurrent policies take place, they can collect massive evidence on past programs’operation and eﬀects that increasingly enables a learning process based on past ex-perience.Program evaluation draws on three dimensions: ex–ante, in–itinere, and ex–post.Abstracting from in–itinere evaluation, this article focuses on how decision-makerscan improve ex-ante evaluation of policy eﬀects based on the results obtained fromthe (antecedent) ex-post evaluations. In tune with the recent literature, I call thisprocess policy learning (Athey and Wager, 2019).How does this process take place? And more speciﬁcally, what do we mean by ex -ante program evaluation in this speciﬁc context? Answering these questions requiresdelving into a deeper understanding of the causal dynamic underlying a policymaker’saction.To ﬁx the idea, consider a policymaker administrating a policy P targeting aspeciﬁc outcome Y . Either for ethical or policy constrains reasons, the policymakerhardly wants to (or can) pick up beneﬁciaries at random. She generally adopts se-lection strategies aimed at optimizing her objective function, typically by increasingthe overall policy eﬀect as much as possible. The point in question is: which eﬀect?It is well-known that the total eﬀect (let’s call it γ ) of a policy intervention can bedecomposed into two sub-eﬀects, a direct one (let’s call it α ), and an indirect one (let’scall it β ). The direct is the eﬀect of the policy “as if” the decision–maker had selectedboth beneﬁciaries and non–beneﬁciaries at random. The indirect eﬀect is that partof the total eﬀect due to the selection process operated by the policymaker’s choiceof beneﬁciaries (the so–called policy assignment rule ). In this setting, we thus have: γ ( S ) = α + β ( S ) (1)where the indirect eﬀect is clearly a function of the selection process S operated bythe policymaker. Diﬀerent policy assignment rules aﬀect the total eﬀect (or welfare )of the policy, thus raising a quest for an “optimal” policy rule within certain classesof feasible rules (Manski, 2004).Ex–post program evaluation techniques, either experimental (randomized controltrials, RCT) or quasi–experimental (re-balancing procedures), generally aim at con-sistently estimating α by assuming β to be a bias induced by the selection process.2his is the econometrician perspective, that does not coincide however with the poli-cymaker viewpoint. Given budget, ethical, or contextual constraints, the policymakeris in fact mainly interested in increasing the total eﬀect of the policy, and thus ﬁndingthe selection rule S ∗ which maximizes this eﬀect (Dehejia, 2005).Recently, the literature has produced signiﬁcant theoretical advances on optimalpolicy assignment, both in statistics, biostatistics and econometrics . Following in thefootsteps of this literature, this paper wants to contribute by stressing the policymakerperspective via a practical implementation of an optimal policy assignment problem(or, equivalently, of an “empirical welfare maximizing” problem). More speciﬁcally,by focusing on the class of threshold–based policies, we ﬁrst set up the theoretical un-derpinnings of the policymaker selection problem, to then oﬀer a practical solution tothis problem via an empirical illustration using the popular LaLonde (1986) trainingprogram dataset. Moreover, the paper proposes an implementation protocol of theoptimal solution that can be easily programmed with standard statistical software .The structure of the paper is as follows. Section 2 presents the theoretical un-derpinnings of the optimal constrained treatment assignment. Section 3 proposes aprotocol for carrying out optimal constrained treatment assignment. Section 4 illus-trates an empirical application using the popular LaLonde (1986) job training datasetand discusses the results. Section 5 concludes the paper. Let X be an individual’s vector of characteristics, Y an outcome of interest, T = { , } a binary treatment. A policy assignment rule G is a function mapping X to T ,specifying which individuals are or are not to be treated: G : X → T Speciﬁcally, this paper refers to the econometric papers of Manski (2004) and Kitagawa andTetenov (2012) who extended the former. In his seminal paper, Manski (2004) assumes diﬀer-ent probability distributions of the possible experimental outcomes, and considers treatment rulesassigning individuals to those treatments producing the best experimental outcomes. These “con-ditional empirical success” rules consider diﬀerent subsets of the observed covariates space. Theauthor derives bounds for the welfare regret associated to such rules showing that it converges tozero at a rate of at least 0 .

5. Kitagawa and Tetenov (2012) extend Manski approach to a broaderclass of rules, including threshold-based ones, and show that similar rate of convergence are achievedby these broader class of rules. Other contributions to the literature are: Dehejia, 2005; Hirano andPorter, 2009; Athey and Wager, 2019; Bhattacharya and Dupas, 2012; Zhao et al., 2012; Zhang etal., 2012. With regard to the application presented in this paper, I developed a Stata 16 program for botha univariate and a bivariate empirical welfare optimization threshold–rule. All codes are availableupon request. τ ( X ) = E ( Y | X ) − E ( Y | X )where Y and Y represent the two potential outcomes of the policy, and E X [ τ ( X )] = τ the average treatment eﬀect. The policy actual total eﬀect (or welfare ) W is deﬁnedas: W = N X i =1 T i · τ ( X i )and the policy unconstrained optimal total eﬀect (or unconstrained maximum welfare )as: W ∗ = N X i =1 T ∗ i · τ ( X i )where: T ∗ i = [ τ ( X i ) > regret , and it is deﬁned as: Regret = W ∗ − W The regret is generally positive as decision makers are unable to cherry-pick the besttreatment assignment possible, given their constrains and available information. Thisis particularly true for randomized control trials (RCT) as treatment is, in this case,deliberately void of any selection objectives.Table 1 sets out an illustrative example of the previous setting with three treatedand three untreated units. The conditional average treatment eﬀect τ ( X ) is reportedin the third column of the table, while the last row of the fourth column shows theactual policy welfare that in this numerical example is equal to 10. By deﬁnition, theoptimal (unconstrained) treatment assignment would select only those units obtaininga positive τ ( X ), that in this speciﬁc case are the units 1, 3, 4, and 6. This is theoptimal selection assignment T ∗ , with the last row of the last column showing that thelargest welfare achievable by the policymaker is equal to 26. The regret is thereforeequal to 26 −

10 = 16.Because of eligibility, budget, ethical, or institutional constrains, policymakersare in general unable to implement the optimal unconstrained policy assignment,4eing therefore obliged to rely on a constrained assignment T ′ which selects treatedunits according to (some) of their characteristics. The welfare thus obtained, W ′ ,may or may not be lower than W ∗ . However, one can ask whether it is possiblefor policymakers to produce the largest possible constrained welfare. As the set offeasible treatment assignment rules is huge, replying to the previous question requiresto restrict the focus on certain classes of policies and ﬁnd, among them, the optimalassignment to treatment ( optimal constrained policy assignment ).Although there are several classes of policies, policymakers often use few of them,such as “threshold-based”, “linear–combination”, or “ﬁxed-depth decision trees” (Kita-gawa and Tetenov, 2018).Threshold-based assignment policies are popular in policy programs as they aresimple to manage and easy to interpret. For this class of policies, policymakers selectone or more than one variable of outstanding importance (such as, for example,ﬁrm number of employees, bank total asset, individual age, etc.) by selecting speciﬁcvalues (the thresholds) discriminating between treated and untreated units. To clarify,suppose to have just one single selection variable x , with threshold c . The assignmentto treatment is clearly a function of c : T i ( c ) = T ∗ i · [ x > = c ]with corresponding welfare: W ( c ) = N X i =1 T i ( c ) · τ ( X i )We deﬁne the optimal choice of the threshold c as the one maximizing W ( c ) over c : c ∗ = argmax c [ W ( c )]If c ∗ exists, the optimal constrained welfare will thus be equal to W ( c ∗ ).Extensions to the case of two or three selection variables are straightforward.Suppose to have two selection variables, x and z , and two corresponding thresholds, c x and c z . In this case, the assignment to treatment is a function of both thresholds,and takes on this form: T i ( c x , c z ) = T ∗ i · [ x > = c x ] · [ z > = c z ]The previous welfare, W ( c ), can thus be maximized over c as a vector of thresh-olds, one for each selection variable. In two dimensions, the previous assignment5ule is called quadrant–assignment , as it selects the upper–right quadrant of the fourquadrants generated by setting the thresholds.There are some problematic aspects that can arise when seeking the optimalthresholds. It is possible that, at the optimal thresholds, the share of the unitsto treat could be too small, or specularly, too large. In this case, it would be mean-ingless to run a policy based on such too small (or too large) number of treated. Tosolve the problem, the policymaker can however consider either budget constraints(for example, the maximum number of units she would be able to treat given theavailability of a certain amount of money), or set in advance a targeted number ofunits to treat (for example, a treatment share between 30% and 40% of the entirereference population). This procedure might lead to a sizable reduction of the welfarebut would preserve the full feasibility of the policy.A related problem arises when, for one or more selection variables, there exists theso-called angle solution due to a monotonic eﬀect of the selection variable(s) on thewelfare. To give a practical example of this occurrence, consider as selection variablethe educational attainment of individuals. In many policy contexts, the welfare (oreﬀect) of the policy increases monotonically with the level of education. This eventwould lead to select only people having the highest level of educational attainmentin the sample. This can be however unfeasible for two reasons: (i) the policy aims attargeting poorly educated people; (ii) the number of treated units would become toosmall as people with high educational attainment are generally few. The solution tothis problem is similar to the one set out above, that is, the policymaker can againintroduce a threshold limit, and/or a preﬁxed range of treatment shares, and thusrun the policy according to these constraints. Again, this procedure may entail ashrinking welfare, but would preserve the possibility to eventually run the policy. ID T τ ( X ) T · τ ( X ) T ∗ T ∗ · τ ( X )1 1 9 9 1 92 1 − − − Table 1: Example of an optimal policy assignment rule. The regret of this policy is equal to16 = 26 − −−> . . . . . −10 −5 0 5 Distribution of t (X) t −−> . . . . −2 0 2 4 6 Distribution of t (X) Figure 1: Distribution of τ (X) and τ (X). Program: National Supported Work Demonstration(NSWD). Data: LaLonde (1986). Target variable: Real earnings in 1978. Estimation technique:Regression–adjustment (with observable heterogeneity). To provide an empirical strategy for carrying out a threshold-based optimal policyassignment decision, we suggest the following procedure consisting of ﬁve steps.7 rocedure . Threshold–based optimal policy assignment1. Suppose to have data from an RCT or from an observational study consisting ofthe information triple (

Y, X, T ) available for every unit involved in the program.2. Run a quasi–experimental method with observable heterogeneity, estimate τ ( X ), and compute the actual total welfare of the policy W .3. Identify the optimal unconstrained policy T ∗ , and compute W ∗ , i.e. the maxi-mum total welfare achievable by the policy, and estimate the regret as W ∗ − W .4. Consider a constrained selection rule T ( x, c ) based on a given set of selectionvariables, x , and related thresholds, c , and deﬁne the maximum constrainedwelfare as W ( x, c ).5. Build a greed of K possible values for c ∈ { c , ..., c K } , compute the optimalvector of thresholds c k ∗ and the corresponding maximum welfare W ( x, c k ∗ ) thusachieved.In the application proposed in the next section, we will show how to implementthis procedure on a real policy dataset. We consider the popular experimental dataset from the National Supported WorkDemonstration (NSWD) used by LaLonde (1986). As well-known, this study lookedat the eﬀectiveness of a job training program (the treatment) administrated in 1976on the real earnings of individuals two years after the completion of the program.It includes a set of demographic, social and economic variables at individual levelsuch as age, race, educational attainment, previous employment condition, and realearnings - as well as the treatment indicator, and the real earnings in 1978, that isour target variable.We ﬁrst estimate the average treatment eﬀect of this program using a regression–adjustment (RA) approach and a speciﬁcation of the model including as control vari-ables: real earnings in 1974 and 1975, age, age squared, an indicator for not having adegree, an indicator for being married, one for being black, and one for being hispanic.As the assignment to treatment of this program was randomized, the averagetreatment eﬀect (ATE) is consistently estimated by the treated–control’s diﬀerence–in–mean (DIM), found to be equal to 1.79 thousand dollars. For the sake of compar-8 . . W e l f a r e Optimal threshold = .96Optimal average welfare = 2.65Number of treated units = 108 out of 443 . . . . W e l f a r e

20 30 40 50 60Age

Optimal threshold = 42Optimal average welfare = 2.85Number of treated units = 16 out of 443 . . . W e l f a r e Optimal threshold = 16Optimal average welfare = 4.24Number of treated units = 0 out of 443

Figure 2: Computation of the policy optimal selection threshold in univariate cases. Program:National Supported Work Demonstration (NSWD). Data: LaLonde (1986). Target variable: realearnings in 1978. Univariate selection variables: real earnings in 1974, age, and educational attain-ment. ison, however, we also estimate ATE by a regression–adjustment obtaining a similarestimation value, 1.76, still signiﬁcant at 1%. The RA approach allows us to estimatealso the average treatment eﬀects conditional on the covariates, i.e. τ ( X ) and τ ( X ),as well as their distributions as set out in Figure 1.Now we move on by implementing the previous empirical welfare maximizingprotocol, ﬁrst in a univariate, and then in a bivariate selection setting.The univariate setting considers only one selection variable. For the sake of com-parison, we run the empirical welfare maximizing protocol separately on four selectionvariables, i.e. real earning in 1974, age, and education. For every selection variable,we compute the optimal threshold, namely, the selection criterion that would allowthe policymaker to maximize the average welfare. Figure 2 illustrates the results.The graphs herein reported show that, for real earning in 1974, the optimal thresh-old is 0 .

96, with corresponding maximum average welfare equal to 2 .

65. Results forage show that the welfare maximizing threshold is found at age 42, with a corre-sponding average welfare of 2 .

85. This means that, if the policymaker had selected9 A ge Average welfare = 3.995Share of treated units = 1%Optimal threshold for ’Age’ = 30.5Optimal threshold for ’Real earnings in 1975’ = 10.9

Figure 3: Computation of the policy optimal decision boundary in the bivariate case. Program:National Supported Work Demonstration (NSWD). Data: LaLonde (1986). Target variable: realearnings in 1978. Bivariate selection variables: real earnings in 1975 and age. as beneﬁciaries people aged 42 or larger, she would have been able to obtain an in-crease of the welfare by 1.09 thousands dollars per beneﬁciary per year, which is thediﬀerence between the pure RCT eﬀect, 1.79, and the one obtained by selecting overage.Results for education are particularly interesting as in this case we have an anglesolution due to monotonicity. It means that the empirical welfare maximization ofthis program would lead to select people with the highest educational attainmentpossible (which is around 15 years). This is clearly unfeasible, as no one would beselected in this case. As said above, this is however a minor issue, as we could adjustour empirical welfare maximization by requiring a certain “minimum” share of unitsto treat, so to make our analysis eventually feasible. Of course, adding this constraintwould lead to a reduction of the obtained level of the welfare (that, in this speciﬁccase, is rather high and equal to 4.24). As the size of the welfare is huge, we expectthat constraining on a minimum share of treated units can allow to obtain a largerwelfare than the one obtained by randomizing the assignment to treatment.To explore this latter aspect, i.e. to what extent introducing a constrain on the10 A ge Average welfare = 2.74Share of treated units = 47 %Year of education = 3 A ge Average welfare = 2.75Share of treated units = 47 %Year of education = 4 A ge Average welfare = 2.77Share of treated units = 45 %Year of education = 5 A ge Average welfare = 2.78Share of treated units = 45 %Year of education = 6 A ge Average welfare = 2.78Share of treated units = 45 %Year of education = 7 A ge Average welfare = 2.83Share of treated units = 41 %Year of education = 8 A ge Average welfare = 2.92Share of treated units = 36 %Year of education = 9 A ge Average welfare = 3.08Share of treated units = 27 %Year of education = 10 A ge Average welfare = 3.63Share of treated units = 14 %Year of education = 11 A ge Average welfare = 3.47Share of treated units = 4 %Year of education = 12 A ge Average welfare = 3.5Share of treated units = 2 %Year of education = 13 A ge Average welfare = 3.92Share of treated units = 0 %Year of education = 14

Figure 4: Computation of policy optimal decision boundaries in the bivariate case, when one ofthe two selection variables (age) is ﬁxed at its optimal threshold, and the threshold of the othervariable (education) is varying. Program: National Supported Work Demonstration (NSWD). Data:LaLonde (1986). Target variable: real earnings in 1978. Bivariate selection variables: age andeducational attainment. share of treated would aﬀect the average welfare, we consider a bivariate setting, i.e.a setting characterized by two selection variables. Before moving to this, however, itseems useful to ﬁrst provide an example of a bivariate optimal treatment assignment.Figure 3 sets out an example where we carried out an empirical welfare maxi-mization of the NSWD jointly over age and real earnings in 1975 (with real earningsin 1978 chosen again as target variable). The ﬁgure also shows the optimal esti-mated decision boundary (the curve drawn in black). Observe that in the univariatecase treated above, the decision boundary collapses to a single point. The optimalthresholds are 30.5 for age, and 10.9 for real earnings in 1975.The upper-right quadrant represents the optimal treatment zone, and it is imme- This is an estimate of the so–called

Bayesian decision boundary as deﬁned by supervised learn-ing classiﬁcation models (Gareth et al., 2013). In our case, this boundary should be closer to theboundary of the upper-right quadrant: it is however smoother and imprecise due to the high sparse-ness (few observations) located in this quadrant, with the largest part of individuals placed in thelower–left quadrant. menu of optimal treatment scenarios among whichshe can make pondered decisions. We show this meanu using as selction variables ageand education within the same dataset.Figure 4 sets out the optimal policy assignment decision boundary over age atdiﬀerent education thresholds . As said, this ﬁgure can be interpreted as a menufor the policymaker to choose among alternative scenarios characterized by diﬀerentpolicy settings entailing a trade–oﬀ between the size of the policy eﬀect and thenumber of units to treat. As long as the educational attainment threshold increases,we observe in fact that the average welfare increases too. The maximum of the averagewelfare is obtained when only one individual is treated, with an average welfare of3.92 and an education of 14 years. Any other intermediate scenario of this menuwould entail a smaller average welfare with, however, a higher number of treatedunits. Given a budget constrain, for example, the policymaker can cherry–pick one ofthese scenarios and run the policy using this menu as reference for an ex-ante optimalre-programming of the policy treatment assignment. The literature on empirical welfare maximization is growing up rapidly. Large avail-ability of datasets on programs already carried out, either based on observational dataor randomized control trials, allows researchers and policymakers to design policiesfor increasing social welfare by optimally ﬁne-tuning treatment assignment. Follow-ing in the footsteps of this recent literature, this paper has stressed the policymaker’sperspective by proposing a practical implementation of an optimal policy assignmentproblem within the class of threshold-based selection rules. Straightforward to applyin practice and to implement with standard statistical software, the proposed pro-cedure and illustrative application can guide policymakers to improve the ex–antedesign of future policies; in other words, to learn from experience. The choice of varying the threshold of education by letting age ﬁxed to its optimal threshold isdictated by education monotonicity as discussed above. It goes without saying that one might dothe opposite as well. eferences [1] Athey S., Wager S. 2019. Eﬃcient Policy Learning. arXiv preprint ,arXiv1702.0289.[2] Bhattacharya D., Dupas P. 2012. Inferring Welfare Maximizing Treatment As-signment under Budget Constraints. Journal of Econometrics , 167, 1, 168–196.[3] Cerulli G. 2015.

Econometric Evaluation of Socio-Economic Programs: Theoryand Applications , Springer.[4] Dehejia R., Wahba S. 2002. Propensity score matching methods for nonexperi-mental causal studies,

Review of Economics and Statistics . 84, 1, 151–161.[5] Dehejia R. 2005. Program Evaluation as a Decision Problem,

Journal of Econo-metrics . 125, 1–2, 141–173.[6] Dehejia R., Wahba S. 1999. Causal eﬀects in nonexperimental studies: reeval-uating the evaluation of training programs.

Journal of the American StatisticalAssociation . 94, 448, 1053–1062.[7] Gareth J., Witten D., Hastie D.T., Tibshirani R. 2013.

An Introduction to Sta-tistical Learning . New York, Springer.[8] Hirano K., Porter J.R. 2009. Asymptotics for Statistical Treatment Rules,

Econo-metrica . 77, 5, 1683–1701.[9] Kitagawa T., Tetenov A. 2018. Who should be treated? empirical welfare maxi-mization methods for treatment choice, textitEconometrica. 86, 2, 591–616.[10] LaLonde R. 1986. Evaluating the Econometric Evaluations of Training Programs.

American Economic Review , 76, 4, 604–620.[11] Manski C.F. (2004), Statistical Treatment Rules for Heterogeneous Populations,Econometrica, 72, 4, 1221–1246.[12] Zhang B., Tsiatis A.A., Laber E.B., Davidian M. 2012, A Robust Method forEstimating Optimal Treatment Regimes.

Biometrics , 68, 4, 1010–1018.[13] Zhao Y., Zeng D., Rush A.J., Kosorok M.R. 2012. Estimating IndividualizedTreatment Rules Using Outcome Weighted Learning.