Double Machine Learning based Program Evaluation under Unconfoundedness
DDouble Machine Learning based Program Evaluationunder Unconfoundedness
Michael C. Knaus † First version: March 9, 2020This version: October 20, 2020
Abstract
This paper reviews, applies and extends recently proposed methods based onDouble Machine Learning (DML) with a focus on program evaluation under uncon-foundedness. DML based methods leverage flexible prediction models to adjust forconfounding variables in the estimation of (i) standard average effects, (ii) differentforms of heterogeneous effects, and (iii) optimal treatment assignment rules. Anevaluation of multiple programs of the Swiss Active Labor Market Policy illustrateshow DML based methods enable a comprehensive program evaluation. Motivatedby extreme individualized treatment effect estimates of the DR-learner method, wepropose the normalized DR-learner to address this issue.
Keywords:
Causal machine learning, conditional average treatment effects, opti-mal policy learning, individualized treatment rules, multiple treatments, DR-learner
JEL classification:
C21 ∗ Financial support from the Swiss National Science Foundation (SNSF) is gratefully acknowledged. Thestudy is part of the project "Causal Analysis with Big Data" (grant number SNSF 407540_166999) of theSwiss National Research Program "Big Data" (NRP 75). I thank Petyo Bonev, Martin Huber, EdwardKennedy, Michael Lechner, Vira Semenova, Anthony Strittmatter, Stefan Wager, and Michael Zimmertfor helpful comments and suggestions. The usual disclaimer applies. † University of St. Gallen. Michael C. Knaus is also affiliated with IZA, Bonn, [email protected]. a r X i v : . [ ec on . E M ] O c t Introduction
The adaptation of so-called machine learning to causal inference has been a productivearea of methodological research in recent years. The resulting new methods complementthe existing econometric toolbox for program evaluation along at least two dimensions (seefor recent overviews Athey & Imbens, 2017, 2019; Abadie & Cattaneo, 2018). On the onehand, they provide flexible methods to estimate standard average effects. In particular,they provide a data-driven approach to variable and model selection in studies that relyon an unconfoundedness assumption for identification. On the other hand, they enable amore comprehensive evaluation by providing new methods for the flexible estimation ofheterogeneous effects and of optimal treatment assignment rules.This paper considers Double Machine Learning (DML) as a framework for a flexibleand comprehensive program evaluation. The DML framework seems attractive because(i) it can be combined with a variety of standard supervised machine learning methods,(ii) it covers average effects for binary (e.g. Belloni, Chernozhukov, & Hansen, 2014;Belloni, Chernozhukov, Fernández-Val, & Hansen, 2017; Chernozhukov, Chetverikov, etal., 2018), multiple (e.g. Farrell, 2015) as well as continuous treatments (e.g. Kennedy, Ma,McHugh, & Small, 2017; Colangelo & Lee, 2019; Semenova & Chernozhukov, 2020), (iii)it naturally extends to the estimation of heterogeneous treatment effects of different formslike canonical subgroup effects, the best linear prediction of effect heterogeneity (Semenova& Chernozhukov, 2020), or nonparametric effect heterogeneity (e.g Fan, Hsu, Lieli, &Zhang, 2019; Zimmert & Lechner, 2019; Foster & Syrgkanis, 2019; Oprescu, Syrgkanis,& Wu, 2019; Kennedy, 2020), and (iv) it can be used to estimate optimal treatmentassignment rules (e.g. Dudik, Langford, & Li, 2011; Athey & Wager, 2017; Zhou, Athey, &Wager, 2018). All these DML based methods have favorable statistical properties and allowthe use of standard tools like t-tests, OLS, kernel regression or supervised machine learningfor estimating causal parameters of interest after flexibly controlling for confounding.This study starts with a review of DML based methods, then applies these methodsin a standard labor economic setting, and comes back to the methods to propose a fixfor a finite sample problem that occurred in the application. Thus, it contributes to the Also known as exogeneity, selection on observables, ignorability, or conditional independence assumption.
Predicted outcomesPredicted treatmentprobabilities Doublyrobust score Average effectsHeterogeneous effects: • subgroup • best linear prediction • nonparametric Optimal treatmentassignment rules steadily growing literature of causal machine learning for program evaluation in threeways. First, the review highlights that the methods for different parameters all build onthe same doubly robust score. The construction of this score might be computationallyexpensive because it requires the estimation of outcomes and treatment probabilities viamachine learning methods. However, the score might be reused for a variety of additionalparameters of interest once constructed (see Figure 1 for a summary). This makes DMLbased methods particularly attractive for researchers who want to avoid using differentframeworks for different parameters as the set of methods increases that integrate machinelearning in the estimation of average treatment effects (e.g. van der Laan & Rubin, 2006;Athey, Imbens, & Wager, 2018; Avagyan & Vansteelandt, 2017; Tan, 2018; Ning, Peng, &Imai, 2018), heterogeneous treatment effects (e.g. Tian, Alizadeh, Gentles, & Tibshirani,2014; Athey & Imbens, 2016; Wager & Athey, 2018; Athey, Tibshirani, & Wager, 2019;Künzel, Sekhon, Bickel, & Yu, 2019) and optimal treatment assignment (e.g. Bansak etal., 2018; Kallus, 2018).Second, we use DML based methods to provide a comprehensive and computation-ally convenient evaluation of four programs of the Swiss Active Labour Market Policy(ALMP) in a standard dataset (Huber, Lechner, & Mellace, 2017). The evaluation inthis paper illustrates the potential of DML based methods for program evaluations underunconfoundedness and provides a potential blueprint for similar analyses. This adds toa small but steadily growing literature that applies causal machine learning to programevaluation in general (e.g. Bertrand, Crépon, Marguerie, & Premand, 2017; Davis & Heller,2017; Strittmatter, 2018; Farbmacher, Heinrich, & Spindler, 2019; Gulyas & Pytka, 2019;Knittel, 2019) and to evaluations based on unconfoundedness in particular (e.g. Knaus,4018; Jacob, Härdle, & Lessmann, 2019; Kreif & DiazOrdaz, 2019; Cockx, Lechner, &Bollens, 2020; Knaus, Lechner, & Strittmatter, 2020a).Third, we contribute to the methodological literature on the flexible estimation ofindividualized treatment effects (see for a recent overview Knaus, Lechner, & Strittmatter,2020b) by proposing the normalized DR-learner (NDR-learner), which builds on the recentDR-learner of Kennedy (2020). The application reveals that the plain DR-learner producesfew extreme effect estimates. However, a normalization similar to the popular Hájek (1971)normalization for inverse probability weighting is shown to stabilize the estimates. Theincreased stability comes at the price that the NDR-learner limits the class of applicablemachine learning methods to linear smoothers (e.g. Random Forests, Ridge or Post-Lasso).Overall, we find that DML based methods provide a promising set of methods forprogram evaluation. The estimated average program effects are in line with the previousliterature. We find that computer, vocational and language courses increase employmentin the 31 months after programs start, while the effects of job search trainings aremostly negative. The heterogeneity analysis reveals substantial heterogeneities by gender,nationality, previous labor market success and qualification. These are picked up by theestimated optimal assignment rules.The paper proceeds as follows. Section 2 defines the estimands of interest and theiridentification under unconfoundedness. Section 3 reviews DML based methods for esti-mation and introduces the NDR-learner. Section 4 presents the application. Section 5describes the implementation of the methods. Section 6 reports the results. Section 7concludes. Appendices A to C provide additional explanations and results. The R-package causalDML implements the applied estimators. A notebook replicates the main results.
We define the estimands of interest in the multiple treatment version of the potentialoutcomes framework (Rubin, 1974; Imbens, 2000; Lechner, 2001). Let W = { , ..., T } denote a set of programs and D i ( w ) = ( W i = w ) a binary variable indicating in which5rogram individual i ( i = 1 , ..., N ) is actually observed. We assume that each individualhas a potential outcome Y i ( w ) for all w ∈ W . Without loss of generality, the discussionbelow assumes that higher outcome values are desirable.The first estimand of interest is the average potential outcome (APO), γ w = E [ Y i ( w )].It answers the question about the average outcome if the whole population was assignedto program w . However, the more interesting question is usually to compare differentprograms w and w . To this end, we take the difference of the according individual potentialoutcomes, Y i ( w ) − Y i ( w ), and aggregate them to different estimands: First, the averagetreatment effect (ATE), δ w,w = E [ Y i ( w ) − Y i ( w )]. Second, the average treatment effect onthe treated (ATET), θ w,w = E [ Y i ( w ) − Y i ( w ) | W i = w ]. Third, the conditional averagetreatment effect (CATE), τ w,w ( z ) = E [ Y i ( w ) − Y i ( w ) | Z i = z ], where Z i ∈ Z is a vectorof observed pre-treatment variables. The different aggregations accommodate the notion that treatment effects might beheterogeneous. ATE represents the average effect in the population, while ATET showsit for the subpopulation that is actually observed in program w . Thus, the comparisonof ATE and ATET can be informative about the quality of the program assignmentmechanism. For example, ATET being larger than ATE indicates that the observedprogram assignment is better than random.The ATET is defined by the observed program assignment and thus not subject tothe choice of the researcher. In contrast, the conditioning variables Z i of the CATE arespecified by the researcher to investigate potentially heterogeneous effects across the groupsof individuals that are defined by different values of Z i . Such heterogeneous effects can beindicative for underlying mechanisms. Further, CATEs characterize which groups win andwhich lose by how much by receiving program w instead of w .The different average effects above provide a comprehensive evaluation of programsunder the current program assignment policy. In many applications, however, we wantto conclude the analysis with a recommendation how the assignment policy could be For DML based estimation with continuous treatments see, e.g Kennedy et al. (2017), Colangelo and Lee(2019) and Semenova and Chernozhukov (2020). This would be Y i (1) − Y i (0) in the canonical binary treatment setting. We focus in this study on expectations of the individual treatment effects. DML based methods forquantile treatment effects can be found, e.g. in Belloni et al. (2017) and Kallus, Mao, and Uehara (2019). π ( Z i ) be a policy that assigns individuals to programs according to their charac-teristics Z i or, put more formally, the function π ( Z i ) maps observable characteristics to aprogram: π : Z → W . In principle, the policy rule can be completely flexible and in theideal world we would assign each individual to the program with the highest conditionalAPO, E [ Y i ( w ) | Z i = z ]. However, in many cases we want to restrict the set of candidatepolicy rules denoted by Π to be interpretable for the communication with decision makersor to incorporate costs or fairness constraints. Each of these candidate policy rules hasa policy value function denoted by Q ( π ) = E [ Y i ( π ( Z i ))] = E [ P w ( π ( Z i ) = w ) Y i ( w )]. Q ( π ) quantifies the average population outcome if policy rule π would be used to assignprograms. The estimand of interest is then the optimal policy rule π ∗ with the highestvalue function for the set of candidate policy rules, or formally π ∗ = arg max π ∈ Π Q ( π ). The previous section defined the estimands of interest in terms of potential outcomes.However, each individual is only observed in one program. Thus, only one potential outcomeper individual is observable and the other potential outcomes remain latent. This is thefundamental problem of causal inference (Holland, 1986) and we need further assumptionsto identify the estimands of interest. In this paper, we consider the unconfoundednessassumption that assumes access to a vector of pre-treatment variables X i ∈ X containing Z i such that the following standard assumptions hold (e.g. Imbens & Rubin, 2015): Assumption 1 (a) Unconfoundedness: Y i ( w ) ⊥⊥ W i | X i = x , ∀ w ∈ W , and x ∈ X .(b) Common support: < P [ W i = w | X i = x ] ≡ e w ( x ) , ∀ w ∈ W and x ∈ X . c) Stable Unit Treatment Value Assumption (SUTVA): Y i = Y i ( W i ) . The unconfoundedness assumption requires that X i contains all confounding variablesthat jointly affect program assignment and the outcome. Common support states that itmust be possible to observe each individual in all programs. SUTVA rules out interference.These assumptions allow the identification of the average potential outcome (APO)conditional on confounders in three common ways: E [ Y i ( w ) | X i = x ] = E [ Y i | W i = w, X i = x ] ≡ µ ( w, x ) (1)= E " D i ( w ) Y i e w ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x (2)= E " µ ( w, x ) + D i ( w )( Y i − µ ( w, x )) e w ( x ) | {z } ≡ Γ( w, x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x (3)Equation 1 shows that the conditional APO is identified as a conditional expectation ofthe observed outcome. Equation 2 shows that it is identified by reweighting the observedoutcome with the inverse treatment probability. Finally, Equation 3 adds the reweightedoutcome residual to the conditional outcome representation of Equation 1. This seemsredundant because we can check that the reweighted residual has expectation zero underunconfoundedness. However, this identification result is doubly robust in the sense that itstill holds if we replace either µ ( w, x ) or e w ( x ) in Equation 3 by arbitrary functions of x . This doubly robust structure plays a crucial role for the estimation procedures that wediscuss in the next section.From an identification perspective, Γ( w, x ) defined in Equation 3 suffices to identifyall estimands of interest stated in the previous subsection:• APO: γ w = E [ Y i ( w )] = E [Γ( w, X i )]• ATE: δ w,w = E [ Y i ( w ) − Y i ( w )] = E [Γ( w, X i ) − Γ( w , X i )]• ATET: θ w,w = E [ Y i ( w ) − Y i ( w ) | W i = w ] = E [Γ( w, X i ) − Γ( w , X i ) | W i = w ]• CATE: τ w,w ( z ) = E [ Y i ( w ) − Y i ( w ) | Z i = z ] = E [Γ( w, X i ) − Γ( w , X i ) | Z i = z ] Appendix A reviews identification and identification double robustness of Equation 3 for completeness. Q ( π ) = E [ Y i ( π ( Z i ))] = E [ P w ( π ( Z i ) = w )Γ( w, X i )]• Optimal policy: π ∗ = arg max π ∈ Π Q ( π ) = arg max π ∈ Π E [ P w ( π ( Z i ) = w )Γ( w, X i )] All Double Machine Learning (DML) based estimators for the estimands of interestdiscussed in the following build on the doubly robust scores of Robins, Rotnitzky, andZhao (1994, 1995). In the following, large Greek letters denote the scores corresponding tothe small Greek letters used to define the estimands in Section 2.1.The construction of the doubly robust scores requires the input of so-called nuisanceparameters that are usually of secondary interest and considered as tool to eventu-ally obtain the parameters of interest. In our case, the two nuisance parameters are µ ( w, x ) = E [ Y i | W i = w, X i = x ] and e w ( x ) = P [ W i = w | X i = x ] for all w . µ ( w, x ) isthe conditional outcome mean for the subgroup observed in program w . e w ( x ) is theconditional probability to be observed in program w , also known as the propensity score.Usually these functions are unknown and need to be estimated. Following Chernozhukov,Chetverikov, et al. (2018) they are estimated based on K -fold cross-fitting: (i) randomlydivide the sample in K folds of similar size, (ii) leave out fold k and estimate models forthe nuisance parameters in the remaining K − µ − k ( w, x ) and ˆ e − kw ( x ) in the left out fold k , and (iv) repeat (i) to (iii) such that eachfold is left out once. This procedure avoids overfitting in the sense that no observationis used to predict its own nuisance parameters. To avoid notational clutter, we ignorethe dependence on the specific fold in the following notation and refer to the cross-fittednuisance parameters as ˆ µ ( w, x ) and ˆ e w ( x ).The main building block of the following estimators is the doubly robust score of the APO , which replaces the true nuisance parameters in Equation 3 by their cross-fittedpredictions: ˆΓ i,w = ˆ µ ( w, X i ) + D i ( w )( Y i − ˆ µ ( w, X i ))ˆ e w ( X i ) . (4)9he ATE score for the comparison of treatment w and w is then constructed as thedifference of the respective APO scores:ˆ∆ i,w,w = ˆΓ i,w − ˆΓ i,w (5)The only estimator we consider that uses the same nuisance parameter but plugs theminto a different score is the ATET estimator. Although the identification result with thedoubly robust APO score in the previous section holds, it is not doubly robust. However,the doubly robust score for the ATET exists and is defined asˆΘ i,w,w = D i ( w )( Y i − ˆ µ ( w , X i ))ˆ e w − D i ( w )ˆ e w ( X i )( Y i − ˆ µ ( w , X i ))ˆ e w ˆ e w ( X i ) , (6)where ˆ e w = N w /N is the unconditional treatment probability with N w counting thenumber of individuals observed in program w (see also, e.g. Farrell, 2015). The estimation of the APOs, ATEs and ATETs boils down to taking the means of thepreviously defined doubly robust scores. For statistical inference, we can rely on standardone-sample t-tests. Thus, the score’s mean and the variance of this mean are the pointand the variance estimate of the respective estimand of interest:• APO: ˆ µ w = N − P i ˆΓ i,w and ˆ σ µ w = N − P i (ˆΓ i,w − ˆ µ w ) • ATE: ˆ δ w,w = N − P i ˆ∆ i,w,w and ˆ σ δ w,w = N − P i ( ˆ∆ i,w,w − ˆ δ w,w ) • ATET: ˆ θ w,w = N − P i ˆΘ i,w,w and ˆ σ θ w,w = N − P i ( ˆΘ i,w,w − ˆ θ w,w ) Note that the estimated variances require no adjustment for the fact that we haveestimated the nuisance parameters in a first step. The resulting estimators are consistent,asymptotically normal and semiparametrically efficient under the main assumption that theestimators of the cross-fitted nuisance parameters are consistent and converge sufficientlyfast (Belloni et al., 2014; Farrell, 2015; Belloni et al., 2017; Chernozhukov, Chetverikov,et al., 2018). In particular, the product of the convergence rates of the outcome and10ropensity score estimators must be at least n / . This allows to apply machine learning toestimate the nuisance parameters. Flexible machine learning estimators converge usuallyslower than the parametric rate n / but several are known to be able to achieve n / ,which would be sufficiently fast if both nuisance parameter estimators achieve it. It is well known that estimators using doubly robust scores and parametric modelsfor the nuisance parameters are doubly robust in the sense that they remain consistentif one of the parametric models is misspecified (see, e.g. Glynn & Quinn, 2009). Thedifference of the DML version is that it exploits what Smucler, Rotnitzky, and Robins(2019) call ’rate double robustness’. This robustness allows to estimate the parametersof interest at the parametric rate n / even if the nuisance parameters are estimated atslower rates using machine learning methods that do not require the specification of anactual parametric model. We can reuse the ATE score of Equation 5 to estimate conditional effects. In the following,we discuss estimators that exploit the fact that the conditional expectation of the scorewith known nuisance parameters equals CATE: τ w,w ( z ) = E [∆ i,w,w | Z i = z ]. Thus, anatural way to estimate CATEs is to use the score with estimated nuisance parameters,ˆ∆ i,w,w , as pseudo-outcome in standard regression frameworks.We consider two special cases of CATEs following Knaus et al. (2020b). (i) Groupaverage treatment effects (GATEs) provide the average effects for pre-specified, usuallylow-dimensional, groups and are thus equivalent to the standard subgroup analysis compar-ing, e.g., men and women. (ii) Individualized average treatment effects (IATEs) aim for themost detailed effect heterogeneity that considers all confounders as heterogeneity variables( τ w,w ( x ) = E [ Y i ( w ) − Y i ( w ) | X i = x ]). We review recently proposed estimators for the Further results, regularity conditions and discussions can be found in section 5.1 of Chernozhukov,Chetverikov, et al. (2018). For example, versions of Lasso (Belloni & Chernozhukov, 2013), Boosting (Luo & Spindler, 2016),Random Forests (Wager & Walther, 2015; Syrgkanis & Zampetakis, 2020), Neural Nets (Farrell, Liang,& Misra, 2018), forward model selection (Kozbur, 2020) or ensembles of those can be shown to achievethe required rates under conditions stated in the original papers. Note that this does not work for the ATET score in Equation 6 and suitable adaptations are beyond thescope of this paper.
Semenova and Chernozhukov (2020) propose to use the pseudo-outcome in an OLSregression and to minimizeˆ β w,w = arg min β N X i =1 (cid:16) ˆ∆ i,w,w − ˜ Z i β w,w (cid:17) , where ˜ Z i contains the original Z i and a constant. The resulting coefficients ˆ β w,w havethe same interpretation as in a standard OLS model. The only difference is that insteadof linearly modelling the level of an outcome, they model the level of a causal effect.Consequently, the fitted values estimate GATEs if we specify a fully saturated OLS model.Otherwise, the fitted values provide the best linear predictor (BLP) of the CATE. Mostimportantly, Semenova and Chernozhukov (2020) show that standard heteroscedasticityrobust standard errors are valid and that we can again ignore the fact that the nuisanceparameters are estimated and potentially converge slower than n / . A complementary option for few continuous Z i is proposed by Fan et al. (2019) andZimmert and Lechner (2019). The pseudo-outcome can also be used in nonparametrickernel regressions (KR): ˆ τ npw,w ( z ) = N X i =1 K h ( Z i − z ) ˆ∆ i,w,w P Ni =1 K h ( Z i − z )where K h ( · ) is a suitable kernel function with bandwidth h . Fan et al. (2019) andZimmert and Lechner (2019) show that, like in the OLS case, the uncertainty of thenuisance parameter estimation can be neglected and standard statistical inference forkernel regression applies. However, there is a price to pay for this flexibility in terms of Formally defined as ˆ τ olsw,w ( z ) = h ˜ z, ˆ β w,w i , where h· , ·i denotes the inner product. n / convergence. For kernel regressions this requirementdepends on the dimension of Z i . For example, the product of the convergence rates need toachieve n / for a one dimensional continuous Z i and further increases with more variables. IATEs may be estimated using the pseudo-outcome in supervised machine learning regres-sions. Kennedy (2020) calls the resulting class of estimators DR-learner and shows thatthe doubly robust structure of the ATE score results in favorable error bounds that wouldnot be attainable by outcome regression or IPW based methods alone. We consider twovariants of the DR-learner. First, we follow the logic of the previous two subsections anduse the full sample as pseudo-outcome in one supervised machine learning regression toestimate IATEs in sample.This full sample procedure is computationally convenient but prone to overfitting.Thus, the second variant aims for out-of-sample IATE predictions for each individual inthe sample. Following Algorithm 1 of Kennedy (2020), this requires a different cross-fittingscheme then the one described in Section 3.1: (i) randomly split the sample in four parts,(ii) use the first part to estimate the propensity score model, (iii) use the second part toestimate the outcome regression models, (iv) use the propensity score and outcome modelsto predict the nuisance parameters in the third part and construct the pseudo-outcomeˆ∆ i,w,w for this part, (v) regress ˆ∆ i,w,w on the covariates of the third part to estimateIATEs, and (vi) use the obtained model to predict IATEs in the fourth part. Each ofthe first three parts can play each role of steps (ii) to (v) once and the resulting threeIATE models are then averaged to provide IATE predictions of the fourth part. Finally,we can iterate such that we receive out-of-sample predictions for each fold (see Algorithm1 in the Appendix for details). The computational downside of this procedure is that wecannot reuse the same nuisance parameter predictions as for the average estimator and The Orthogonal Random Forest of Oprescu et al. (2019) is another estimator that is based on thepseudo-outcome idea and can be asymptotically normal under the assumption of parameteric nuisanceparameters. We focus in this paper on the more general DR-learner. Z i and impossible for high-dimensional Z i asdiscussed for example by Chernozhukov, Demirer, Duflo, and Fernandez-Val (2017). The DR-learner shares the problem of all estimators that involve reweighting by theinverse of the propensity score. The inverse probability weights do not sum to one in finitesamples. We therefore propose to adapt the idea of Hájek (1971) and to normalize theinverse probability weights to sum to one. This normalization is recommended to stabilizeestimators for average effects (e.g. Imbens, 2004; Lunceford & Davidian, 2004; Robins,Sued, Lei-Gomez, & Rotnitzky, 2007; Busso, DiNardo, & McCrary, 2014). However, itcould play an even bigger role in the estimation of conditional effects as finite sampleimbalances are more likely to occur on the individualized level. Thus, we propose thenormalized DR-learner (NDR-learner) as a stabilized complement to the DR-learner.The NDR-learner is less flexible than the DR-learner in the sense that it requires toapply linear smoothers (e.g. Buja, Hastie, & Tibshirani, 1989, and references therein) toestimate IATEs. However, this restriction allows still to use popular machine learningmethods like tree-based methods (regression trees, Random Forests or boosted trees),Ridge or any method that runs OLS after variable selection like Post-Lasso (Belloni &Chernozhukov, 2013). Note further that the nuisance parameters can still be estimatedwith supervised machine learning methods that are non-linear smoothers.Linear smoothers can be represented as linear combination of (pseudo-)outcomes.This means, we know the weight α i ( x ) that each individual (pseudo-)outcome receives inpredicting the (pseudo-)outcome at x . When such weights are available, the DR-learner14stimated IATE can be expressed asˆ τ drlw,w ( x ) = N X i =1 α i ( x ) ˆ∆ i,w,w = N X i =1 α i ( x )[ˆ µ ( w, X i ) − ˆ µ ( w , X i )]+ N X i =1 α i ( x ) D i ( w )ˆ e w ( X i ) | {z } λ wi ( x ) ˜ Y i ( w, X i ) − N X i =1 α i ( x ) D i ( w )ˆ e w ( X i ) | {z } λ w i ( x ) ˜ Y i ( w , X i ) , (7)where ˜ Y i ( w, X i ) = Y i − ˆ µ ( w, X i ) denotes the individual specific outcome residual oftreatment arm w . In finite samples, λ wi ( x ) and λ w i ( x ) usually do not sum to one. This isespecially problematic if it sums to something much greater then one. In this case theweighted residuals receive much more weight then the outcome regressions. This mightresult in implausibly large effect estimates that could even fall outside of the possiblebounds of a given outcome variable (Kang & Schafer, 2007; Robins et al., 2007). The NDR-learner normalizes the weights to sum to one:ˆ τ ndrlw,w ( x ) = N X i =1 α i ( x )[ˆ µ ( w, X i ) − ˆ µ ( w , X i )]+ N X i =1 λ wi ( x ) ! − N X i =1 λ wi ( x ) ˜ Y i ( w, X i ) − N X i =1 λ w i ( x ) ! − N X i =1 λ w i ( x ) ˜ Y i ( w , X i ) (8)This is more demanding from a computational point of view because it requires to calculatethe weights α i ( x ) and the normalization for each x of interest (Algorithm 2 providesthe details of the implementation). However, the application below shows that thenormalization deals well with the cases where outcome residuals receive high weightsleading to implausibly large effect estimates. Thus, the NDR-learner is an interestingalternative to the DR-learner if effect sizes become suspicious. For bounded outcomes, the effects must lie in the interval [ Y min − Y max , Y max − Y min ], with Y min and Y max denoting the minimum and maximum values of the outcome, respectively. .4 Optimal treatment assignment The APO score of Section 3.1 can also be reused to estimate optimal treatment assignment.To this end, note that the value function of any policy rule π ( Z i ) can be estimated asˆ Q ( π ) = N − N X i =1 T X w =0 ( π ( Z i ) = w )ˆΓ i,w . This means each individual contributes the score of the treatment that she is assignedto under this policy rule. However, we are not necessarily interested in the value functionof some policy rule, but want to estimate the optimal policy rule that maximizes this valuefunction, ˆ π ∗ = arg max π ∈ Π ˆ Q ( π ). This requires to search over all candidate policy rules tofind the optimum as there exists no closed form solution. Example:
Consider the case where Z i is a binary covariate and W i is a binary treatment.We have four different policy rules: treat nobody ( π ), treat only those with Z i = 1 ( π ),treat only those with Z i = 0 ( π ), or treat everybody ( π ). We illustrate this using tworepresentative observations, i = 1 with Z = 0, and i = 2 with Z = 1 in Table 1. Thecolumns three to six show the assignments under the four potential assignment rules. Forexample, the first observation receives no treatment under policy rules π and π , but istreated under policy rules π and π . To find the optimal rule, we compare the means ofthe APO scores in the last four columns and pick the policy rule that corresponds to thelargest mean. The number of policy values to compare increases dramatically in settingswith multiple treatments and Z i being a vector of potentially non-binary variables.Table 1: Example of DML based optimal treatment assignment i Z i π π π π ˆ Q ( π ) ˆ Q ( π ) ˆ Q ( π ) ˆ Q ( π )1 0 0 0 1 1 ˆΓ , ˆΓ , ˆΓ , ˆΓ , , ˆΓ , ˆΓ , ˆΓ , ... ... ... ... ... ... ... ... ... ...We expect that the estimated policy in finite samples and with estimated nuisanceparameters does not coincide with the true optimal policy rule. This is conceptualized asthe ’regret’ defined as the difference between the true and the estimated optimal value16unction, R (ˆ π ∗ ) = Q ( π ∗ ) − Q (ˆ π ∗ ).Zhou et al. (2018) show that the DML based procedure minimizes the maximum regretasymptotically under two main conditions: First, the same convergence conditions for thenuisance parameters that are required for ATE estimation (the product of the nuisanceparameter convergence rates achieves n / ). Second, the set of candidate policy rules Πis not too complex. In particular, Zhou et al. (2018) show that decision trees with fixeddepth are a suitable class of policy rules. Again the double robustness of the used scoresresults in statistical guarantees that are not achievable for methods based on outcomeregressions or IPW alone. We use a standard dataset of Swiss Active Labor Market Policy (ALMP) that is alreadybasis of previous studies (Huber et al., 2017; Lechner, 2018; Knaus et al., 2020a) toestimate the effect of different programs on employment. In particular, we start withthe sample of 100,120 unemployed individuals of Huber et al. (2017) that consists of24 to 55 year old individuals registered unemployed individuals in 2003. We considernon-participants and participants of four different program types : job search, vocationaltraining, computer programs and language courses. As the assignment policies differsubstantially across the three language regions, we focus only on individuals living in theGerman speaking part and remove those in the French and Italian speaking part to avoidcommon support problems.This leaves us with 67,577 observations. We evaluate the first program participationwithin the first six months after the begin of the unemployment spell. One problemof this definition is that non-participants comprise people that quickly come back intoemployment before they would be assigned to a training program. This could result in anoverly optimistic evaluation of non-participation. We follow Lechner (1999) and Lechner Gerfin and Lechner (2002), Lalive, van Ours, and Zweimüller (2008) and Knaus et al. (2020a) amongothers provide a more detailed description of the surrounding institutional setting. The dataset is available as restricted use file via FORSbase (ref study: 13867). The dataset contains also participants of an employment program and personality training. However,we leave them out to keep the number of obtained results manageable.
No program Job search Vocational Computer Language(1) (2) (3) (4) (5)No. of observations 47,653 11,610 858 905 1504Outcome: months employed of 31 14.7 14.4 18.4 19.2 13.5Female (binary) 0.44 0.44 0.33 0.60 0.55Age 36.61 37.31 37.45 39.08 35.28Foreigner (binary) 0.37 0.33 0.30 0.21 0.67Employability 1.93 1.98 1.93 1.97 1.85Past income in CHF 10,000 4.25 4.67 4.87 4.32 3.73
Note:
Employability is an ordered variable with one indicating low employability, two mediumemployability and three high employability. The exchange rate USD/CHF was roughly 1.3 at thattime. The full set of variables is reported in Table C.1. and Smith (2007) and assign pseudo program starting points to the non-participants andkeep only those who are still unemployed at this point. This results in a final samplesize of 62,530 observations.The outcome of interest is the cumulated number of months in employment in the 31months after program start, which is the maximum available time span in the dataset.Row one of Table 2 provides the number of observations in each group. Roughly 75%participate in no program. By far the largest program is the job search program, whichis also called basic program. The more specific programs are much smaller with roughly1000 observations each. Row two shows that the average outcomes substantially differby different groups. However, it is not clear whether this is only due to selection effectsbecause the observable characteristics are not comparable across groups, as the remainingrows show. Especially the share of females, the share of foreigners and past incomediffer quite substantially across programs. The control variables comprise 45 variablesthat are reported in Table C.1. They consist of socio-economic characteristics of theunemployed individuals, caseworker characteristics, information about the assignmentprocess, information about the previous job and regional economic indicators. The assignment of the pseudo starting point is based on estimated probabilities to start a program at aspecific time. The probability depends also on covariates and is estimated using the same random forestspecification that is discussed later in Section 5. Implementation
We estimate the nuisance parameters via Random Forest (Breiman, 2001) using theimplementation with honest splitting in the grf
R-package (Athey et al., 2019) and5-fold cross-fitting. The tuning parameters in each regression are selected by out-of-bagvalidation. All regressions apply the full set of control variables listed in Table C.1. We runthe outcome regressions for each treatment group separately to obtain ˆ µ ( w, x ). Also thepropensity scores are separately estimated for each treatment using a treatment indicatoras outcome in the random forest. The propensity scores are then normalized to sum toone within an individual.We estimate CATEs at different granularity. First, we investigate GATEs for subgroupsby gender, foreigners and three categories of employability. These are regularly used in theprogram evaluation literature and usually investigated by re-estimating everything in thesubgroups. However, it can be performed at very low computational costs after DML foraverage effects using only a standard OLS regression with the pseudo-outcome as describedin Section 3.3.1 and dummy variables for all groups but the reference group as covariates.Second, we estimate kernel regression CATEs for the continuous variables age and pastincome based on the R-package np (Hayfield & Racine, 2008). The kernel regressions applya second-order Gaussian kernel function and use 0.9 of the cross-validated bandwidth forundersmoothing as suggested by Zimmert and Lechner (2019). Third, we specify an OLSmodel in which all the five previously used variables enter linearly. Finally, we go beyondthe handpicked variables and estimate the IATEs using all 45 control variables in theDR-learner and the NDR-learner. Both are implemented with the honest Random Forestbecause the grf package allows to extract the prediction weights α i ( x ) required for theNDR-learner. We apply both variants described in Section 3.3.3. Once we estimate theIATE for each observation using DR- and NDR-learner in the full sample and once wepredict them out-of-sample. For the latter, Appendix B provides a detailed description ofthe underlying DR- and NDR-learner algorithms.The optimal treatment assignment rule is estimated as decision trees of depth one,two and three. We follow Algorithm 2 for exact tree-search of Zhou et al. (2018) that is19able 3: Steps of implementation Step Input Operation Output1. W i , X i Predict treatment probabilities ˆ e w ( x )2. Y i , W i , X i Predict treatment specific outcomes ˆ µ ( w, x )3. Y i , W i , ˆ e w ( x ), ˆ µ ( w, x ) Plug into Equation 4 ˆΓ i,w
4. ˆΓ i,w
Mean, one-sample t-test APOs5. ˆΓ i,w
Take difference ˆ∆ i,w,w
6. ˆ∆ i,w,w Mean, one-sample t-test ATEs7. ˆ∆ i,w,w , Z i Ordinary least squares GATEs or BLP CATEs8. ˆ∆ i,w,w , Z i Kernel regression KR CATEs9. ˆ∆ i,w,w , X i Supervised Machine Learning IATEs10. ˆΓ i,w , Z i Optimal decision tree Optimal treament rule implemented in the policytree
R-package (Sverdrup, Kanodia, Zhou, Athey, & Wager,2020). We estimate the trees first with the five handpicked variables. However, thesevariables include gender and foreigner status that might be too sensitive to include inpractice. Thus, we investigate another set of 16 variables that includes only the objectivemeasures of education and labor market history of the unemployed persons that would beavailable to the caseworker from the administrative records.Table 3 summarizes all required implementation steps. It highlights that a comprehen-sive DML based program evaluation can be run with few lines of code in any statisticalsoftware program that is capable of the operations in the third column. Thus, researcherscan build their customized analyses in a modular fashion based on established code. Alter-natively, the R-package causalDML already implements the required steps as showcased inthe replication notebook accompanying this paper.
We focus here on the effect estimates and discuss the nuisance parameters in AppendixC.2. Throughout this section, we compare the four programs to non-participation. Recall that the outcome of interest is the cumulated number of months employed in the31 months after program start. Figure 2 depicts ATE and ATET estimates and shows The underlying APOs are shown in Figure C.2 of Appendix C. J ob s ea r c h V o ca ti on a l C o m pu t e r L a ngu a g e A v e r a g e t r ea t m e n t e ff ec t s Estimand
ATEATET
Note:
The figure shows the point estimates of the average treatment effects of participatingin the program labeled on the x-axis vs. non-participation and their 95% confidence intervals.Numeric results in Panels B and C of Table C.5. substantial differences in the effectiveness of programs. The job search program decreasesthe months in employment on average by about one month. In contrast, other programsthat teach hard skills show substantial improvements with roughly three additional monthsin employment on average. Comparing ATE and ATET shows no big differences for most programs. This suggeststhat there is either no effect heterogeneity correlated with observables or that the assignmentdoes not take advantage of this heterogeneity. We would expect to see ATETs being higherthan ATEs if program assignment is well targeted. However, we find only evidence for theopposite as the actual participants of a language course show a 1.5 months lower treatmenteffect compared to the population. This difference suggests that there is substantial effectheterogeneity to uncover and the potential to improve treatment assignment.
This subsection studies effect heterogeneity at different granularity. We start by estimatinggroup average treatment effects (GATEs). Panel A of Table 4 shows the result of an OLS For a better understanding of the underlying dynamics, Figure C.3 in Appendix C reports and discussesthe effects of program participation on the employment probabilities over time.
Job search Vocational Computer Language(1) (2) (3) (4)
Panel A:
Constant -1.27 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ (0.17) (0.55) (0.60) (0.46)Female 0.57 ∗∗ -1.10 2.53 ∗∗∗ -1.67 ∗∗ (0.25) (0.87) (0.86) (0.76) Panel B:
Constant -1.28 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ (0.16) (0.53) (0.50) (0.51)Foreigner 0.73 ∗∗∗ ∗∗ -0.80 -2.91 ∗∗∗ (0.26) (0.89) (0.94) (0.71) Panel C:
Constant -0.15 5.36 ∗∗∗ ∗∗∗ ∗∗∗ (0.33) (1.03) (1.09) (0.88)Medium employability -0.94 ∗∗∗ -2.28 ∗∗ -2.63 ∗∗ -0.17(0.36) (1.15) (1.20) (0.98)High employability -1.70 ∗∗∗ -4.62 ∗∗∗ -3.29 ∗ ∗∗∗ ∗∗ ∗ Note:
This table shows OLS coefficients and their heteroscedasticity robuststandard errors (in parentheses) of regressions run with the pseudo-outcomedefined as described in Section 3.3. ∗ p < ∗∗ p < ∗∗∗ p < regression with a female dummy as covariate, ˆ∆ i,w,w = β + β f emale i + error i . Theconstant ( β ) provides the GATE for the reference group men and the female coefficient( β ) describes how much the GATE differs for women. The results show substantialgender differences in the effectiveness of programs. Women significantly suffer less orprofit more from job search and computer program participation. This gender gap in theeffectiveness of ALMPs is also well-documented in the literature (Crépon & van den Berg,2016; Card, Kluve, & Weber, 2018). In contrast to this, we find that women profit onaverage significantly less from language courses than men.Panel B replaces the female dummy in the regression by a foreigner dummy. Strikingly,Swiss citizens as reference group show a big positive effect for participating in languagecourses but the effect disappears for foreigners. After adding the coefficient for foreignersto the constant, the foreigners’ GATE is only 0.71 (3 . − .
91, standard error: 0.62). Acrucial information to better understand this finding would be to know which languagesthey learn, which is unfortunately not available in this dataset.Panel C shows the results of a similar regression but now with two dummies indicating22igure 3: Effect heterogeneity regarding past income −50510 0 25000 50000 75000
Past income C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (a) Job search −50510 0 25000 50000 75000 Past income C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (b) Vocational −50510 0 25000 50000 75000 Past income C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (c) Computer −50510 0 25000 50000 75000 Past income C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (d) Language Note:
Dotted line indicates point estimate of the respective average treatment effect. Grey areashows 95%-confidence interval. medium and high employability such that low employability becomes the reference group.The F-statistic in the last line tests the joint significance of the two dummies. It is statisti-cally significant at least at the 10%-level for the programs in the first three columns. Theyall show a common gradient that individuals with low employability benefit substantiallymore or at least suffer less from program participation.
While subgroup analyses are standard in program evaluations, the estimation of kernelregression CATEs along continuous variables is rarely pursued. We estimate such CATEsalong the continuous variables past income and age and find no notable heterogeneity forthe latter. However, effect sizes are clearly associated with past income. Figure 3 shows Figure C.4 in the appendix shows the according results.
Job search Vocational Computer Language(1) (2) (3) (4)Constant -0.49 3.89 5.16 ∗∗ ∗∗ (0.71) (2.38) (2.37) (2.11)Female 0.21 -1.97 ∗∗ ∗∗ -1.31 ∗ (0.27) (0.92) (0.91) (0.79)Age 0.03 ∗ ∗∗ ∗ ∗∗∗ (0.27) (0.90) (0.96) (0.74)Medium employability -0.65 ∗ -1.48 -2.31 ∗ -0.75(0.37) (1.17) (1.22) (1.01)High employability -1.21 ∗∗ -3.29 ∗∗ -2.82 -0.37(0.51) (1.52) (1.72) (1.51)Past income in CHF 10,000 -0.26 ∗∗∗ -0.64 ∗∗∗ -0.42 ∗∗ ∗ (0.06) (0.23) (0.19) (0.18)F-statistic 6.72 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ Note:
This table shows OLS coefficients and their heteroscedasticity robuststandard errors (in parentheses) of regressions run with the pseudo-outcome asdescribed in Section 3.3.1. ∗ p < ∗∗ p < ∗∗∗ p < that effects decrease with higher past income for all but for language programs. The latterhave only a small positive effect for individuals with low past income but it increases withhigher income. One potential explanation for these findings is that the value of languageskills is larger for high-skilled workers in multilingual countries like Switzerland becausethey reduce information costs across language borders (see, e.g. Isphording, 2014). The CATEs considered so far were nonparametric but only univariate. Now we model theCATE by specifying a multivariate OLS regression with the previously used covariatesentering linearly. It is most likely misspecified and thus estimates the best linear predictor(BLP) of CATEs with respect to these variables. However, it provides a compact andaccessible summary of the effect heterogeneities. Additionally, it holds the other includedvariables constant. Consider for example the coefficients for being female in Table 5.Compared to Table 4, the coefficients in the first three columns are smaller and the one forlanguage courses is larger (for example for job search it is 0.2 instead of 0.6). The reasonis that it represents a partial effect that holds other variables like past income fixed. The24igure 4: Boxplot of out-of-sample predicted IATEs by DR- and NDR-learner −60−40−20020 J ob s ea r c h V o ca ti on a l C o m pu t e r L a ngu a g e I nd i v i du a li ze d a v e r a g e t r ea t m e n t e ff ec t DRLNDRL
Note:
The figure shows the distribution of IATEs for participating in the program labeled on thex-axis vs. non-participation estimated by the DR-learner (DRL) and the NDR-learner (NDRL).The dashed line indicates the possible range of the IATE of [-31,31] to illustrate that severalDR-learner estimated IATEs lie outside this bound. subgroup female coefficient in Table 4 partly picks up that women have lower past incomeand that lower income is associated with higher treatment effects for all but languagecourses. This example illustrates that the same strategies that are usually applied tointerpret an outcome OLS model can now be used to interpret the effect OLS model.
We focus on the results based on the out-of-sample variant of the DR- and NDR-learneras the full sample variant leads to severe overfitting with predicted IATEs ranging from-209 to 165 that are up to seven times larger than what is possible given that the outcomeis bounded between zero and 31. However, Figure 4 shows that the DR-learner producesimpossible effect sizes even out-of-sample, which motivates the proposal of the NDR-learneras stabilized variant. Figure 4 provides boxplots of the predicted IATEs and shows severalsubstantial outliers lying below the smallest possible value of -31. However, the descriptivestatistics provided in Table C.6 and the joint and marginal distributions depicted in FigureC.6 document that besides the outliers, the distributions are quite similar and correlate See Appendix C.5 for results and discussion of the full sample.
Job search Vocational Computer Language(1) (2) (3) (4)Previous job: unskilled worker 0.97 0.73 0.39 -1.38Past income -1.33 -0.92 -1.13 1.03Mother tongue other than German, French, Italian 0.65 0.71 0.05 -1.28Qualification: some degree -0.85 -0.68 -0.44 1.28Swiss citizen -0.61 -0.68 0.05 1.27Qualification: unskilled 0.77 0.47 0.32 -1.16Fraction of months employed last 2 years -1.02 -0.42 -0.44 0.35Previous job: skilled worker -0.76 -0.47 -0.17 1.02
Note:
Table shows the differences in means of standardized covariates between the fifth and the firstquintile of the respective estimated IATE distribution. with at least 0.87. Not surprisingly, the impact of normalized weights is much larger forthe three smaller programs and nearly negligible for job search programs. Still, we basethe following discussion for all programs on the more stable results of the NDR-learner.We conduct a classification analysis as proposed by Chernozhukov, Fernandez-Val, andLuo (2018) to understand which variables are most predictive of effect sizes. To this end,we split the predicted IATE distributions in quintiles and compare the covariate means ofthe observations falling into the fifth and first quintile. For comparability, we normalize allcovariates to have mean zero and variance one. Table 6 shows the eight variables that haveat least one absolute difference between the highest and lowest quintile that is larger thenone standard deviation. For example, we observe that the group with the highest effects(the fifth quintile) of a job search program has a 1.33 standard deviations lower past incomecompared to the lowest IATE group (the first quintile). Also the other variables confirmthe patterns that we document already in previous subsections. The effects of job search,vocational and computer training are higher for unskilled workers with lower previouslabor market success and foreigners, while the opposite holds for language programs. The previous section documented substantial heterogeneities in the program effects. Toleverage this heterogeneity for better targeting, we apply the DML based optimal policy Table C.7 shows the classification analysis for all variables. (a) Depth 1 & 5 covariates (b) Depth 2 & 5 covariates(c) Depth 1 & 16 covariates (d) Depth 2 & 16 covariates
Notes:
Optimal assignment rules estimated following the procedure defined in Section 3.4. algorithm of Section 3.4. Figure 5a shows the simplest decision tree with only one split forthe five handpicked covariates. It would allocate men to vocational training and women tocomputer courses. This split is probably similar to what we would have suggested giventhe evidence presented in Table 4. For a tree of depth two, such an eyeballing approachhas its limits and the algorithmic approach provides a systematic way to arrive at anestimated optimal decision tree. The tree in Figure 5b splits first on being a foreigner andthen along past income. In the absence of the possibility to split on gender, the depth onetree in Figure 5c splits on past income roughly at the same value where the KR CATEs ofcomputer and language training intersect in Figure 3. Panel A of Table 7 summarizes the results of the different trees. It shows the percentageof individuals that are placed in the different programs. Not surprisingly, all individuals arerecommended to be placed into one of the three positively evaluated hard skill enhancingprograms.One yet unsolved challenge is how to draw statistical inference about the quality andstability of the decision trees. Athey and Wager (2017) propose a form of cross-validation.To this end, we use the same folds that were used in the cross-fitting procedure to estimate Appendix C.6 provides also the trees of depth three.
No program Job search Vocational Computer Language(1) (2) (3) (4) (5)
Panel A: Percent allocated to program
Depth 1 & 5 variables 0 0 56 44 0Depth 2 & 5 variables 0 0 33 47 19Depth 3 & 5 variables 0 0 34 45 21Depth 1 & 16 variables 0 0 0 54 46Depth 2 & 16 variables 0 0 19 40 40Depth 3 & 16 variables 0 0 26 38 35
Panel B: Cross-validated difference to APOs
Depth 1 & 5 variables 3.27 ∗∗∗ ∗∗∗ ∗ (0.41) (0.42) (0.50) (0.49) (0.42)Depth 2 & 5 variables 3.79 ∗∗∗ ∗∗∗ ∗∗∗ (0.41) (0.42) (0.42) (0.52) (0.46)Depth 3 & 5 variables 3.92 ∗∗∗ ∗∗∗ ∗∗∗ (0.42) (0.43) (0.47) (0.48) (0.48)Depth 1 & 16 variables 3.44 ∗∗∗ ∗∗∗ ∗∗ (0.41) (0.43) (0.51) (0.48) (0.42)Depth 2 & 16 variables 3.63 ∗∗∗ ∗∗∗ ∗∗ (0.42) (0.43) (0.49) (0.50) (0.44)Depth 3 & 16 variables 3.51 ∗∗∗ ∗∗∗ ∗∗ (0.43) (0.45) (0.47) (0.49) (0.47) Note:
Panel A shows the percentage of individuals being assigned to a specific program.Panel B shows a t-test of the difference of the cross-validated policy (standard errors inparentheses) and the APOs of the programs. ∗ p < ∗∗ p < ∗∗∗ p < the nuisance parameters. We build the decision tree in four folds and evaluate the valuein the left out fold. First, we inspect how often the recommendations based on thesetrees coincide with the full sample policy rules. Figures C.8 and C.10 show that thecross-validated trees are not identical to the full sample ones.Zhou et al. (2018) propose another validation idea and test whether the optimal policyrules perform significantly better than sending all individuals to the same program. Thisis achieved by taking the difference of the APO score of the cross-validated policy rule andthe APO score of the program w : ˆ∆ cvi,w ( π ) = P Tt =0 (ˆ π cv ( Z i ) = t )ˆΓ i,t − ˆΓ i,w , where ˆ π cv ( Z i )is the policy rule that is estimated without individual i . A standard t-test on the mean ofˆ∆ cvi,w ( π ) tests then whether the cross-validated policy rules are significantly better thansending everybody to the same program. Note that the cross-validated policy rules do notnecessarily coincide with the trees in the full sample and the cross-validation estimates not28he value function for that specific tree. This would require to hold out a test set, whichwould be viable for an application with bigger programs.The results are provided in Panel B of Table 7. We can interpret the mean of ˆ∆ cvi,w ( π )as average treatment effect comparing a regime under the estimated assignment ruleor a regime where everybody is sent to the same program. This effect is positive forall but one tree specifications indicating that the estimated rules can leverage the effectheterogeneities to improve the allocation. However, the cross-validated policy rules performnot significantly better than sending just everybody into vocational or computer programs.This would probably change if we could take costs or capacity constraints into account.However, we do not observe costs in this dataset and the optimal decision tree algorithmis currently not capable of incorporating capacity constraints in a systematic way. Weleave both extensions for future research using a more detailed database on both costsand capacity constraints. This paper considers recent methodological developments based on Double Machine Learn-ing (DML) through the lens of a standard program evaluation under unconfoundedness.DML based methods provide a convenient toolbox for a comprehensive program evaluationas different parameters of interest can be estimated using the same framework and a combi-nation of standard statistical software. The application to an Active Labor Market Policyevaluation shows that the methods also produce plausible results in practice. The onlyexception is the DR-learner that required a modification before producing stable resultsfor all individualized treatment effects. However, several conceptual and implementationalissues remain open for investigation and refinement.In general, we know little about how to choose the estimator for the nuisance parame-ters. The pool of potential machine learning algorithms and their combinations is largeand little is known, e.g., about the trade-off between high prediction performance andcomputation time in the causal setting. Also clear recommendations for the implemen-tation of cross-fitting are missing. Another open question is how to deal with common29upport in general and for each estimand specifically. The literature on trimming rules iswell developed for propensity score based methods estimating average effects. However, weare not only interested in average effects and the propensity score is not the only nuisanceparameter of DML. It remains an open question whether the established trimming methodsare also sensible in settings where common support becomes an issue.The estimators for flexible heterogeneous treatment effects provide interesting new tools.However, it is currently not clear to what extent we can actually explore heterogeneityor to what extent we need to pre-define the heterogeneity of interest. The possibility tosummarize pre-defined heterogeneity of interest using OLS or kernel regressions provideclearly valuable and easy to use options in applications. The instability of methodsthat aim for individualized heterogeneous effects shows that they should be used withcaution and more research is required to investigate whether adjustments like the proposedNDR-learner are useful beyond the application of this paper.The estimation of optimal treatment assignment rules is mostly unexplored in practiceand many interesting issues in applications regarding inference, the implementation ofdifferent constraints, more flexible rules than decision trees, or the choice of variablesthat could or should enter the set of policy variables, which could be explored in futureresearch.The investigation of these DML specific questions but also the comparison with othermore specialized causal machine learning methods for each estimand provides also aninteresting direction of future research. Such evidence would help to understand and guidewhich choices are critical in applications similar to the one in this paper.30 eferences
Abadie, A., & Cattaneo, M. D. (2018). Econometric methods for program evaluation.
Annual Review of Economics , , 465–503.Athey, S., & Imbens, G. (2019). Machine learning methods that economists should knowabout. Annual Review of Economics , , 685–725.Athey, S., & Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences , (27), 7353–7360.Athey, S., & Imbens, G. W. (2017). The state of applied econometrics: causality andpolicy evaluation. Journal of Economic Perspectives , (2), 3–32.Athey, S., Imbens, G. W., & Wager, S. (2018). Approximate residual balancing: Debiasedinference of average treatment effects in high dimensions. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , (4), 597–632.Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. Annals ofStatistics , (2), 1148 - 1178.Athey, S., & Wager, S. (2017). Efficient policy learning.
Retrieved from https://arxiv.org/abs/1606.02647
Avagyan, V., & Vansteelandt, S. (2017).
Honest data-adaptive inference for the av-erage treatment effect under model misspecification using penalised bias-reduceddouble-robust estimation.
Retrieved from http://arxiv.org/abs/1708.03787
Bansak, K., Ferwerda, J., Hainmueller, J., Dillon, A., Hangartner, D., Lawrence, D., &Weinstein, J. (2018). Improving refugee integration through data-driven algorithmicassignment.
Science , (6373), 325–329.Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection inhigh-dimensional sparse models. Bernoulli , (2), 521–547.Belloni, A., Chernozhukov, V., Fernández-Val, I., & Hansen, C. (2017). Program evaluationand causal inference with high-dimensional data. Econometrica , (1), 233–298.Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment effectsafter selection among high-dimensional controls. Review of Economic Studies , (2),608–650. 31ertrand, M., Crépon, B., Marguerie, A., & Premand, P. (2017). Contemporaneousand post-program impacts of a public works program: Evidence from Côte d’Ivoire. World Bank Working Paper .Breiman, L. (2001). Random forests.
Machine Learning , (1), 5–32.Buja, A., Hastie, T., & Tibshirani, R. (1989). Linear smoothers and additive model. TheAnnals of Statistics , (2), 453–510.Busso, M., DiNardo, J., & McCrary, J. (2014). New evidence on the finite sample propertiesof propensity score reweighting and matching estimators. Review of Economics andStatistics , (5), 885–897.Card, D., Kluve, J., & Weber, A. (2018). What works? A meta analysis of recent activelabor market program evaluations. Journal of the European Economic Association , (3), 894–931.Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., &Robins, J. (2018). Double/Debiased machine learning for treatment and structuralparameters. The Econometrics Journal , (1), C1-C68.Chernozhukov, V., Demirer, M., Duflo, E., & Fernandez-Val, I. (2017). Generic machinelearning inference on heterogenous treatment effects in randomized experiments.
Retrieved from http://arxiv.org/abs/1712.04802
Chernozhukov, V., Fernandez-Val, I., & Luo, Y. (2018). The sorted effects method:Discovering heterogeneous effects beyond their averages.
Econometrica , (6),1911–1938.Cockx, B., Lechner, M., & Bollens, J. (2020). Priority to unemployed immigrants? Acausal machine learning evaluation of training in Belgium.
CEPR Discussion PaperNo. DP14270.Colangelo, K., & Lee, Y.-Y. (2019).
Double debiased machine learning nonparametricinference with continuous treatments. cemmap working paper CWP72/19.Crépon, B., & van den Berg, G. J. (2016). Active labor market policies.
Annual Reviewof Economics , , 521–546.Davis, J. M., & Heller, S. B. (2017). Using causal forests to predict treatment heterogeneity:An application to summer jobs. American Economic Review , (5), 546–550.32udik, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning.
Retrieved from http://arxiv.org/abs/1103.4601
Fan, Q., Hsu, Y.-C., Lieli, R. P., & Zhang, Y. (2019).
Estimation of conditional averagetreatment effects with high-dimensional data.
Retrieved from http://arxiv.org/abs/1908.02399
Farbmacher, H., Heinrich, K., & Spindler, M. (2019).
Heterogeneous Effects of Poverty onCognition.
MEA Discussion Paper No. 06-2019.Farrell, M. H. (2015). Robust inference on average treatment effects with possibly morecovariates than observations.
Journal of Econometrics , (1), 1–23.Farrell, M. H., Liang, T., & Misra, S. (2018). Deep neural networks for estimationand inference: Application to causal effects and other semiparametric estimands.
Retrieved from http://arxiv.org/abs/1809.09953
Foster, D. J., & Syrgkanis, V. (2019).
Orthogonal statistical learning.
Retrieved from http://arxiv.org/abs/1901.09036
Gerfin, M., & Lechner, M. (2002). A microeconometric evaluation of the active labourmarket policy in Switzerland.
Economic Journal , (482), 854–893.Glynn, A. N., & Quinn, K. M. (2009). An introduction to the augmented inverse propensityweighted estimator. Political Analysis , (1), 36–56.Gulyas, A., & Pytka, K. (2019). Understanding the sources of earnings losses after jobdisplacement: A machine-learning approach.
Discussion Paper Series – CRC TR 224No. 131.Hájek, J. (1971). Comment on “An essay on the logical foundations of survey sampling,part one”. In V. P. Godambe & D. A. Sprott (Eds.),
Foundations of statisticalinference (p. 236). Toronto: Holt, Rinehart and Winston.Hayfield, T., & Racine, J. S. (2008). Nonparametric econometrics: The np package.
Journal of Statistical Software , (5).Hirano, K., & Porter, J. R. (2009). Asymptotics for statistical treatment rules. Economet-rica , (5), 1683–1701.Holland, P. W. (1986). Statistics and causal inference. Journal of the American StatisticalAssociation , (396), 945–960. 33uber, M., Lechner, M., & Mellace, G. (2017). Why do tougher caseworkers increaseemployment? The role of program assignment as a causal mechanism. Review ofEconomics and Statistics , (1), 180–183.Imbens, G. W. (2000). The role of the propensity score in estimating dose-responsefunctions. Biometrika , (3), 706–710.Imbens, G. W. (2004). Nonparametric estimation of average treatment effects underexogeneity: A review. Review of Economics and Statistics , (1), 4–29.Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedicalsciences . Cambridge University Press.Isphording, I. E. (2014).
Language and labor market success (No. 8572). IZA DiscussionPapers.Jacob, D., Härdle, W. K., & Lessmann, S. (2019).
Group Average Treatment Effects forObservational Studies.
Retrieved from http://arxiv.org/abs/1911.02688
Kallus, N. (2018). Balanced policy evaluation and learning. In
Advances in neuralinformation processing systems (pp. 8895–8906).Kallus, N., Mao, X., & Uehara, M. (2019).
Localized Debiased Machine Learning: EfficientEstimation of Quantile Treatment Effects, Conditional Value at Risk, and Beyond.
Retrieved from http://arxiv.org/abs/1912.12945
Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparisonof alternative strategies for estimating a population mean from incomplete data.
Statistical Science , (4), 523–539.Kennedy, E. H. (2020). Optimal doubly robust estimation of heterogeneous causal effects.
Retrieved from http://arxiv.org/abs/2004.14497
Kennedy, E. H., Ma, Z., McHugh, M. D., & Small, D. S. (2017). Non-parametric methodsfor doubly robust estimation of continuous treatment effects.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , , 1229–1245.Kitagawa, T., & Tetenov, A. (2018). Who should be treated? Empirical welfare maxi-mization methods for treatment choice. Econometrica , (2), 591–616.Knaus, M. C. (2018). A double machine learning approach to estimate the effects of musicalpractice on student’s skills.
Retrieved from https://arxiv.org/abs/1805.10300
Journal of HumanResources , , published ahead of print 26 March 2020. doi: 10.3368/jhr.57.2.0718-9615R1Knaus, M. C., Lechner, M., & Strittmatter, A. (2020b). Machine Learning Estimation ofHeterogeneous Causal Effects: Empirical Monte Carlo Evidence. The EconometricsJournal , utaa014 , published ahead of print 06 June 2020. doi: 10.1093/ectj/utaa014Knittel, C. R. (2019). Using machine learning to target treatment: The case of householdenergy use.
NBER Working Paper No. 26531.Kozbur, D. (2020). Analysis of testing-based forward model selection.
Econometrica , (5), 2147–2173.Kreif, N., & DiazOrdaz, K. (2019). Machine learning in policy evaluation: new tools forcausal inference.
Retrieved from http://arxiv.org/abs/1903.00402
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimatingheterogeneous treatment effects using machine learning.
Proceedings of the NationalAcademy of Sciences , (10), 4156–4165.Lalive, R., van Ours, J., & Zweimüller, J. (2008). The impact of active labor marketprograms on the duration of unemployment. Economic Journal , (525), 235–257.Lechner, M. (1999). Earnings and employment effects of continuous gff-the-job training ineast germany after unification. Journal of Business & Economic Statistics , (1),74–90.Lechner, M. (2001). Identification and estimation of causal effects of multiple treatmentsunder the conditional independence assumption. In M. Lechner & E. Pfeiffer (Eds.), Econometric evaluation of labour market policies (pp. 43–58). Heidelberg: Physica.Lechner, M. (2018).
Modified causal forests for estimating heterogeneous causal effects.
Retrieved from https://arxiv.org/abs/1812.09487
Lechner, M., & Smith, J. (2007). What is the value added by caseworkers?
LabourEconomics , (2), 135–151.Lunceford, J. K., & Davidian, M. (2004). Stratification and weighting via the propensityscore in estimation of causal treatment effects: A comparative study. Statistics in edicine , (19), 2937–2960.Luo, Y., & Spindler, M. (2016). High-dimensional L2-boosting: Rate of Convergence.
Retrieved from http://arxiv.org/abs/1602.08927
Manski, C. F. (2004). Statistical treatment rules for heterogeneous populations.
Econo-metrica , (4), 1221–1246.Ning, Y., Peng, S., & Imai, K. (2018). Robust estimation of causal effects viahigh-dimensional covariate balancing propensity score.
Retrieved from http://arxiv.org/abs/1812.08683
Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019). Orthogonal random forest for causal infer-ence. , ,8655–8696.Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficientswhen some regressors are not always observed. Journal of the American StatisticalAssociation , (427), 846–866.Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1995). Analysis of semiparametric regressionmodels for repeated outcomes in the presence of missing data. Journal of theAmerican Statistical Association , (429), 106–121.Robins, J. M., Sued, M., Lei-Gomez, Q., & Rotnitzky, A. (2007). Comment: Performanceof double-robust estimators when "inverse probability" weights are highly variable. Statistical Science , (4), 544–559.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonran-domized studies. Journal of Educational Psychology , (5), 688–701.Semenova, V., & Chernozhukov, V. (2020). Debiased machine learning of conditionalaverage treatment effects and other causal functions. The Econometrics Journal , utaa027 , published ahead of print 29 August 2020. doi: https://doi.org/10.1093/ectj/utaa027Smucler, E., Rotnitzky, A., & Robins, J. M. (2019). A unifying approach for doubly-robustL1 regularized estimation of causal contrasts.
Retrieved from http://arxiv.org/abs/1904.03737
Stoye, J. (2009). Minimax regret treatment choice with finite samples.
Journal of conometrics , (1), 70–81.Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validityof experiments. Journal of Econometrics , (1), 138–156.Strittmatter, A. (2018). What is the value added by using causal machine learningmethods in a welfare experiment evaluation?
Retrieved from http://arxiv.org/abs/1812.06533
Sverdrup, E., Kanodia, A., Zhou, Z., Athey, S., & Wager, S. (2020). policytree: Policylearning via doubly robust empirical welfare maximization over trees.
Journal ofOpen Source Software , (50), 2232.Syrgkanis, V., & Zampetakis, M. (2020). Estimation and inference with trees and forestsin high dimensions.
Retrieved from http://arxiv.org/abs/2007.03210
Tan, Z. (2018).
Model-assisted inference for treatment effects using regularized calibratedestimation with high-dimensional data.
Retrieved from http://arxiv.org/abs/1801.09817
Tian, L., Alizadeh, A. A., Gentles, A. J., & Tibshirani, R. (2014). A simple methodfor estimating interactions between a treatment and a large number of covariates.
Journal of the American Statistical Association , (508), 1517–1532.van der Laan, M. J., & Rubin, D. (2006). Targeted maximum likelihood learning. International Journal of Biostatistics , (1).Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effectsusing random forests. Journal of the American Statistical Association , (523),1228–1242.Wager, S., & Walther, G. (2015). Adaptive concentration of regression trees, with applicationto random forests.
Retrieved from http://arxiv.org/abs/1503.06388
Wunsch, C. (2016). How to minimize lock-in effects of programs for unemployed workers.
IZA World of Labor .Zhou, Z., Athey, S., & Wager, S. (2018).
Offline multi-action policy learning: Generalizationand optimization.
Retrieved from http://arxiv.org/abs/1810.04778
Zimmert, M., & Lechner, M. (2019).
Nonparametric estimation of causal heterogeneityunder high-dimensional confounding.
Retrieved from http://arxiv.org/abs/1908 ppendices A Doubly robust identification
To revisit identification and identification double robustness of Equation 3 under As-sumption 1, rewrite the conditional average potential outcome in the following way,where µ w ( x ) = E [ Y i ( w ) | X i = x ], µ ( w, x ) = E [ Y i | W i = w, X i = x ] and e w ( x ) = E [ D i ( w ) | X i = x ] = P [ W i = w | X i = x ] b > µ w ( x ) = E " µ ( w, x ) + D i ( w )( Y i − µ ( w, x )) e w ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E " Y i ( w ) − Y i ( w ) + µ ( w, x ) + D i ( w )( Y i − µ ( w, x )) e w ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E " Y i ( w ) − Y i ( w ) + µ ( w, x ) + D i ( w )( Y i ( w ) − µ ( w, x )) e w ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E [ Y i ( w ) | X i = x ] + E " ( Y i ( w ) − µ ( w, x )) D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = µ w ( x ) + E " ( Y i ( w ) − µ ( w, x )) D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x (9)The conditional average potential outcome is thus identified if the second part ofEquation 9 equals zero. This happens under three scenarios:1. Correct propensity score and correct outcome regression: E " ( Y i ( w ) − µ ( w, x )) D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E [( Y i ( w ) − µ ( w, x )) | X i = x ] E " D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = ( E [ Y i ( w ) | X i = x ] − µ ( w, x )) E [ D i ( w ) | X i = x ] − e w ( x ) e w ( x ) ! = ( µ w ( x ) − µ ( w, x )) e w ( x ) − e w ( x ) e w ( x ) ! = ( µ w ( x ) − µ w ( x )) | {z } =0 e w ( x ) − e w ( x ) e w ( x ) !| {z } =0 = 02. Correct propensity score but instead of correct outcome regression µ ( w, x ), use some39unction g ( x ): E " ( Y i ( w ) − g ( x )) D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E [( Y i ( w ) − g ( x )) | X i = x ] E " D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = ( E [ Y i ( w ) | X i = x ] − g ( x )) E [ D i ( w ) | X i = x ] − e w ( x ) e w ( x ) ! = ( µ w ( x ) − g ( x )) e w ( x ) − e w ( x ) e w ( x ) !| {z } =0 = 03. Correct outcome regression but instead of correct propensity score e w ( x ), use somefunction h ( x ): E " ( Y i ( w ) − µ ( w, x )) D i ( w ) − h ( x ) h ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E [( Y i ( w ) − µ ( w, x )) | X i = x ] E " D i ( w ) − h ( x ) h ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = ( E [ Y i ( w ) | X i = x ] − µ ( w, x )) E [ D i ( w ) | X i = x ] − h ( x ) h ( x )) ! = ( µ w ( x ) − µ ( w, x )) e w ( x ) − h ( x ) h ( x ) ! = ( µ w ( x ) − µ w ( x )) | {z } =0 e w ( x ) − h ( x ) h ( x ) ! = 040 DR- and NDR-learner
This Appendix describes the algorithms that are applied to estimate out-of-sample IATEsusing the DR- and NDR-learner. It mostly follows Algorithm 1 of Kennedy (2020) andadapts it to the situation that we are interested in estimating IATEs for all observationswithout using them in the estimation step.
Algorithm 1 (DR-learner)
Let ( S N , S N , S N , S N ) denote four independent samples of N observations of O i = ( X i , W i , Y i ) .Step 1. Nuisance training:(a) Construct a model ˆ e w ( x ) of the propensity scores e w ( x ) using S N .(b) Construct a model (ˆ µ ( w, x ) , ˆ µ ( w , x )) of the regression functions ( µ ( w, x ) , µ ( w , x )) using S N .Step 2. Pseudo-outcome regression: Construct the pseudo-outcome for every observation i insubsample S N using the models of step 1 ˆ∆ i,w,w = ˆ µ ( w, X i ) − ˆ µ ( w , X i ) + D i ( w )ˆ e w ( X i ) ˜ Y i ( w, X i ) − D i ( w )ˆ e w ( X i ) ˜ Y i ( w , X i ) , regress it on covariates X i in S N , and use the model to predict IATEs in S N , ˆ τ , w,w ( x ) .Step 3. Cross-fitting: Repeat steps 1–2 twice, first using S N for the propensity score, S N for the outcome regression and S N as subsample to obtain IATE predictions in S N ˆ τ , w,w ( x ) , and then using S N for the propensity score, S N for the outcome regressionand S N as subsample to obtain IATE predictions in S N , ˆ τ , w,w ( x ) .Step 4. Prediction: Predict IATEs in S N as the average of the three predictions ˆ τ drlw,w ( x ) =1 / τ , w,w ( x ) + 1 / τ , w,w ( x ) + 1 / τ , w,w ( x ) .Step 5. Iteration: Repeat steps 1–4 three times. First, with S N , S N and S N to predict IATEsfor S N , second with S N , S N and S N to predict IATEs for S N and finally with S N , S N and S N to predict IATEs for S N . Algorithm 2 (NDR-learner)
Let ( S N , S N , S N , S N ) denote four independent samplesof N observations of O i = ( X i , W i , Y i ) .Step 1. Nuisance training:(a) Construct a model ˆ e w ( x ) of the propensity scores e w ( x ) using S N .(b) Construct a model (ˆ µ ( w, x ) , ˆ µ ( w , x )) of the regression functions ( µ ( w, x ) , µ ( w , x )) using S N .Step 2a. Pseudo-outcome regression: Construct the pseudo-outcome for every observation i insubsample S N using the models of step 1 ˆ∆ i,w,w = ˆ µ ( w, X i ) − ˆ µ ( w , X i ) + D i ( w )ˆ e w ( X i ) ˜ Y i ( w, X i ) − D i ( w )ˆ e w ( X i ) ˜ Y i ( w , X i ) , regress it on covariates X i in S N , and use the model to predict IATEs in S N .Step 2b. Normalization: For every observation j in S N : (i) extract the weights underlyingits prediction α i ( X j ) and (ii) use it to calculate the normalized DR-learner given inEquation 8, where the sum goes over observations in S N , to obtain ˆ τ , w,w ( X j ) .Step 3. Cross-fitting: Repeat steps 1–2 twice, first using S N for the propensity score, S N for the outcome regression and S N as subsample to obtain IATE predictions in S N ˆ τ , w,w ( x ) , and then using S N for the propensity score, S N for the outcome regressionand S N as subsample to obtain IATE predictions in S N , ˆ τ , w,w ( x ) .Step 4. Prediction: Predict IATEs in S N as the average of the three predictions ˆ τ drlw,w ( x ) =1 / τ , w,w ( x ) + 1 / τ , w,w ( x ) + 1 / τ , w,w ( x ) .Step 5. Iteration: Repeat steps 1–4 three times. First, with S N , S N and S N to predict IATEsfor S N , second with S N , S N and S N to predict IATEs for S N and finally with S N , S N and S N to predict IATEs for S N . Results
C.1 Descriptives
Table C.1 provides the means of all control variables by program participation. It documentsthat especially measures of past labor market success like past income are associated withprogram participation.Table C.1: Means of control variables by program
No JS Voc Comp Lang(1) (2) (3) (4) (5)Age 36.6 37.3 37.5 39.1 35.3Mother tongue in canton’s language 0.10 0.12 0.11 0.11 0.04Lives in big city 0.19 0.19 0.21 0.11 0.23Lives in medium city 0.12 0.13 0.12 0.15 0.15Lives in no city 0.68 0.68 0.67 0.73 0.63Caseworker age 44.1 44.1 44.8 44.6 44.6Caseworker cooperative 0.48 0.50 0.41 0.42 0.45Caseworker education: above vocational training 0.45 0.45 0.44 0.48 0.48Caseworker education: tertiary track 0.19 0.21 0.17 0.16 0.21Caseworker female 0.43 0.47 0.39 0.44 0.47Missing caseworker characteristics 0.05 0.05 0.04 0.05 0.05Caseworker has own unemployemnt experience 0.62 0.63 0.64 0.61 0.63Caseworker tenure 5.48 5.44 5.73 5.83 5.61Caseworker education: vocational degree 0.26 0.27 0.22 0.25 0.22Fraction of months employed last 2 years 0.81 0.84 0.83 0.84 0.72Number of employment spells last 5 years 1.21 0.97 0.93 0.86 0.78Employability 1.93 1.98 1.93 1.97 1.85Female 0.44 0.44 0.33 0.60 0.55Foreigner with temporary permit 0.13 0.11 0.12 0.04 0.44Foreigner with permanent permit 0.23 0.22 0.18 0.17 0.23Cantonal GDP p.c. 0.52 0.53 0.51 0.53 0.54Married 0.47 0.46 0.48 0.45 0.72Mother tongue other than German, French, Italian 0.33 0.29 0.31 0.18 0.64Past income 42527.9 46693.1 48653.8 43212.8 37300.5Previous job: manager 0.08 0.08 0.10 0.09 0.07Missing sector 0.18 0.15 0.15 0.16 0.29Previous job in primary sector 0.09 0.06 0.09 0.05 0.05Previous job in secondary sector 0.12 0.14 0.15 0.13 0.12Previous job in tertiary sector 0.61 0.65 0.61 0.67 0.54Previous job: self-employed 0.01 0.00 0.00 0.00 0.00Previous job: skilled worker 0.60 0.65 0.65 0.75 0.43Previous job: unskilled worker 0.29 0.24 0.22 0.15 0.48Qualification: semiskilled 0.16 0.14 0.17 0.14 0.15Qualification: some degree 0.58 0.62 0.63 0.72 0.38Qualification: unskilled 0.23 0.20 0.17 0.12 0.40Qualification: skilled without degree 0.03 0.03 0.02 0.02 0.07Swiss citizen 0.63 0.67 0.70 0.79 0.34Allocation of unemployed to caseworkers: by industry 0.60 0.67 0.58 0.51 0.64Allocation of unemployed to caseworkers: by occupation 0.51 0.57 0.46 0.45 0.57Allocation of unemployed to caseworkers: by age 0.04 0.04 0.04 0.06 0.05Allocation of unemployed to caseworkers: by employability 0.09 0.07 0.10 0.08 0.06Allocation of unemployed to caseworkers: by region 0.13 0.09 0.09 0.13 0.11Allocation of unemployed to caseworkers: other 0.09 0.07 0.08 0.10 0.09Number of unemployment spells last 2 years 0.57 0.39 0.52 0.37 0.43Cantonal unemployment rate (in %) 3.52 3.59 3.41 3.36 3.63
Note:
Program specific means. .2 Nuisance parameters Nuisance parameters are only a tool to remove confounding but it is still informative toinvestigate which variables are most predictive of treatment probabilities and outcome.This is less straightforward for flexible tools like random forests than for the well-knownregression outputs of parametric models. We conduct a classification analysis as proposedby Chernozhukov, Fernandez-Val, and Luo (2018). To this end, we split the predictednuisance parameter distributions in quintiles and compare the covariate means of theobservations falling into the fifth and first quintile. For comparability, we normalize allcovariates to have mean zero and variance one and order the variables by their largestabsolute difference between the highest and lowest quintile.Table C.2 shows that measures of citizenship, qualification and previous labor marketsuccess are important predictors of program selection. In line with intuition the formerseems to drive a large part of the selection into language courses. Also Table C.3 showingthe classification analysis for outcome predictions shows intuitive patterns. Again measuresof citizenship, qualification and previous labor market success seem predictive for futureemployment with suggested correlations pointing in the expected directions. For example,Swiss citizens, individuals with a degree and high past income are overrepresented in theupper quintile, while individuals with a non-Swiss mother tongue and no qualification areunderepresented in the upper quintile.Finally, we investigate the propensity score distributions for all programs. Figure C.1shows that propensity scores are quite variable. This indicates that selection into programsis not negligible. Further, Table C.4 shows that some of the propensity scores get quitesmall with the smallest one being 0.003 for a computer training participant. This is notsurprising given that already the unconditional participation probabilities for computerand vocational training are only about 0.015. However, the small propensity score perse is not an indicator of poor overlap. The residual with the smallest propensity scorereceives a weight of ∼ / .
003 = 333, which is only 0.5% of the total weights. Notethat we could easily increase the smallest propensity score by randomly removing a largefraction of non-participants and participants of the job search program. This would discardvaluable information and shows that the mere focus on the smallest propensity score can44e misleading in cases with imbalanced treatment group sizes. More importantly, weobserve overlap in the sense that all treatment groups contain individuals with similarlylow propensity scores. Thus, overlap seems not to be a major issue in our application, atleast for the low dimensional parameters of interest.Table C.2: Classification analysis of propensity scores
No program Job search Vocational Computer Language(1) (2) (3) (4) (5)Foreigner with temporary permit -0.24 -0.33 -0.32 -1.24 1.95Swiss citizen 0.09 0.21 0.43 1.60 -1.86Mother tongue other than German, French, Italian 0.11 -0.43 -0.33 -1.58 1.76Previous job: unskilled worker 0.32 -0.44 -0.60 -1.52 0.99Past income -0.71 0.77 1.40 0.28 -0.06Previous job: skilled worker -0.21 0.32 0.30 1.32 -0.90Qualification: some degree -0.19 0.31 0.51 1.26 -0.93Qualification: unskilled 0.05 -0.16 -0.55 -1.15 0.86Married -0.23 -0.02 -0.17 -0.60 1.09Female -0.07 -0.06 -1.04 0.99 0.42Cantonal unemployment rate (in %) -0.71 0.74 -0.91 -0.55 -0.13Foreigner with permanent permit 0.09 0.03 -0.24 -0.82 0.55Age -0.34 0.34 0.36 0.81 -0.20Cantonal GDP p.c. -0.64 0.67 -0.75 -0.17 -0.11Number of employment spells last 5 years 0.74 -0.54 -0.39 -0.46 -0.34Allocation of unemployed to caseworkers: by occupation -0.47 0.47 -0.64 -0.20 0.08Allocation of unemployed to caseworkers: by region 0.53 -0.53 0.02 0.05 0.02Fraction of months employed last 2 years -0.27 0.47 0.52 0.41 -0.46Allocation of unemployed to caseworkers: by industry -0.43 0.44 -0.48 -0.52 0.01Previous job: manager -0.20 0.18 0.51 0.18 0.10Employability -0.44 0.49 0.14 0.34 -0.24Previous job in tertiary sector -0.20 0.27 0.01 0.45 -0.31Missing sector 0.16 -0.29 -0.41 -0.34 0.43Lives in big city 0.08 -0.08 -0.02 -0.43 0.10Caseworker cooperative -0.06 0.10 -0.43 -0.22 -0.02Caseworker female -0.29 0.27 -0.41 0.15 -0.02Number of unemployment spells last 2 years 0.39 -0.33 -0.04 -0.39 0.02Qualification: skilled without degree -0.08 -0.02 -0.11 -0.27 0.36Lives in no city -0.01 0.04 -0.02 0.33 -0.12Previous job in primary sector 0.30 -0.26 0.30 -0.27 -0.03Qualification: semiskilled 0.24 -0.23 -0.01 -0.24 0.10Caseworker tenure -0.03 0.02 0.22 0.11 0.05Previous job in secondary sector -0.14 0.16 0.21 -0.05 -0.02Allocation of unemployed to caseworkers: by employability 0.19 -0.19 0.11 -0.00 -0.01Caseworker education: tertiary track -0.19 0.19 -0.11 -0.17 0.00Caseworker age -0.06 0.04 0.12 0.08 0.17Mother tongue in canton’s language -0.03 0.11 -0.05 0.02 0.16Caseworker has own unemployemnt experience -0.14 0.16 0.09 -0.00 0.01Caseworker education: vocational degree -0.11 0.14 -0.15 0.08 -0.15Allocation of unemployed to caseworkers: other 0.06 -0.04 -0.14 0.01 -0.07Caseworker education: above vocational training 0.12 -0.13 0.04 0.12 0.02Allocation of unemployed to caseworkers: by age -0.02 0.01 -0.08 0.07 -0.00Lives in medium city -0.08 0.03 0.06 0.05 0.06Previous job: self-employed 0.01 0.01 0.04 0.03 -0.07Missing caseworker characteristics 0.07 -0.06 0.04 -0.02 0.01
Note:
Table shows the differences in means of normalized covariates between the fifth and the first quintile of the respectivepropensity score distribution. Variables are ordered according to the largest absolute difference.
No program Job search Vocational Computer Language(1) (2) (3) (4) (5)Mother tongue other than German, French, Italian -1.44 -1.66 -1.15 -2.05 -1.88Swiss citizen 1.36 1.58 1.13 1.97 1.84Qualification: some degree 1.79 1.97 1.61 1.67 1.87Previous job: unskilled worker -1.72 -1.75 -1.57 -1.67 -1.91Past income 1.50 1.39 1.18 0.85 1.72Qualification: unskilled -1.42 -1.57 -1.56 -1.41 -1.54Previous job: skilled worker 1.23 1.26 1.21 1.37 1.40Foreigner with permanent permit -0.87 -1.10 -0.71 -1.36 -1.17Number of unemployment spells last 2 years -0.92 -0.80 -1.22 -0.59 -0.54Married -1.00 -1.14 -0.60 -1.16 -1.03Foreigner with temporary permit -0.83 -0.87 -0.71 -1.11 -1.16Fraction of months employed last 2 years 0.99 0.79 0.93 0.65 0.87Employability 0.94 0.83 0.45 0.50 0.61Cantonal unemployment rate (in %) -0.10 -0.05 -0.81 -0.04 -0.04Cantonal GDP p.c. -0.04 0.01 -0.79 0.04 0.06Age -0.72 -0.77 -0.24 -0.32 -0.26Lives in big city -0.24 -0.23 -0.73 -0.37 -0.28Missing sector -0.50 -0.48 -0.63 -0.53 -0.71Number of employment spells last 5 years -0.53 -0.42 -0.71 -0.46 -0.40Qualification: semiskilled -0.62 -0.67 -0.27 -0.47 -0.59Lives in no city 0.26 0.22 0.66 0.43 0.32Previous job: manager 0.55 0.53 0.44 0.30 0.66Female -0.42 -0.33 -0.48 0.30 -0.59Previous job in tertiary sector 0.32 0.37 0.17 0.54 0.51Mother tongue in canton’s language -0.22 -0.24 -0.18 -0.44 -0.32Qualification: skilled without degree -0.35 -0.38 -0.22 -0.34 -0.35Allocation of unemployed to caseworkers: by occupation 0.17 0.21 0.30 0.38 0.24Previous job in secondary sector 0.07 0.05 0.29 -0.04 0.08Caseworker age 0.01 0.03 0.27 -0.13 0.06Caseworker female -0.00 0.02 -0.18 0.24 -0.02Previous job in primary sector 0.06 -0.04 0.22 -0.17 -0.01Caseworker tenure -0.04 -0.06 -0.10 -0.21 -0.06Lives in medium city -0.09 -0.04 -0.06 -0.17 -0.12Allocation of unemployed to caseworkers: by employability 0.05 0.03 0.16 0.05 0.04Caseworker education: vocational degree 0.12 0.09 0.15 0.08 0.10Caseworker education: above vocational training 0.02 0.04 0.10 0.13 0.05Allocation of unemployed to caseworkers: by industry 0.12 0.12 0.10 0.11 0.09Allocation of unemployed to caseworkers: by region 0.05 -0.03 0.11 -0.03 -0.03Missing caseworker characteristics -0.04 -0.05 -0.10 0.02 -0.05Caseworker cooperative -0.03 -0.04 -0.08 0.03 -0.04Allocation of unemployed to caseworkers: by age -0.01 -0.00 0.03 0.07 0.03Caseworker education: tertiary track 0.05 0.04 0.01 -0.06 0.00Caseworker has own unemployemnt experience 0.03 0.05 0.00 0.01 0.02Previous job: self-employed -0.03 -0.01 -0.03 -0.02 0.02Allocation of unemployed to caseworkers: other -0.03 -0.03 0.00 -0.01 0.01
Note:
Table shows the differences in means of normalized covariates between the fifth and the first quintile of the respectiveoutcome prediction distribution. Variables are ordered according to the largest absolute difference.
Table C.4: Summary statistics of propensity score distributions
No program Job search Vocational Computer LanguageMean 0.763 0.185 0.014 0.015 0.024SD 0.093 0.094 0.006 0.007 0.030Minimum 0.328 0.028 0.005 0.003 0.005Q1 0.435 0.044 0.007 0.004 0.007Q25 0.727 0.121 0.009 0.009 0.010Q50 0.778 0.172 0.012 0.013 0.015Q75 0.822 0.223 0.017 0.018 0.025Q99 0.910 0.517 0.031 0.038 0.110Maximum 0.935 0.619 0.061 0.098 0.487
Note:
The table provides summary statistics of the program specific propensityscore distributions. The rows denoted by Q show the respective quantiles.
Propensity score d e n s it y (a) No program
024 0.0 0.2 0.4 0.6
Propensity score d e n s it y (b) Job search Propensity score d e n s it y (c) Vocational Propensity score d e n s it y (d) Computer Propensity score d e n s it y (e) Language .3 Average treatment effects Figure C.2 shows the APO estimates and Figure C.3 shows the effects of program par-ticipation on the employment probabilities over time. The latter documents that allprogram show the well-known lock-in effect within the first months after program start(e.g. Wunsch, 2016). However, participants of the hard skill programs catch up and showa sustained increase in employment rates of up to 10 percentage points.Figure C.2: Average potential outcomes N o p r og r a m J ob s ea r c h V o ca ti on a l C o m pu t e r L a ngu a g e A v e r a g e po t e n ti a l ou t c o m e Notes:
Average potential outcomes with 95% confidence intervals. Numeric results in Panel A ofTable C.5. −0.10.00.10.2 0 10 20 30
Month A v e r a g e t r ea t m e n t e ff ec t s (a) Job search −0.10.00.10.2 0 10 20 30 Month A v e r a g e t r ea t m e n t e ff ec t s (b) Vocational −0.10.00.10.2 0 10 20 30 Month A v e r a g e t r ea t m e n t e ff ec t s (c) Computer −0.10.00.10.2 0 10 20 30 Month A v e r a g e t r ea t m e n t e ff ec t s (d) Language Notes:
Solid lines show ATE, dashed lines ATET of the respective programs compared tononparticipation on employment probability in the 31 months after program start. Grey areadepicts the 95% confidence intervals.
Estimate Standard error(1) (2)
Panel A:
APONo program 14.80 0.06Job search 13.78 0.12Vocational 18.02 0.42Computer 18.16 0.43Language 17.36 0.37
Panel B:
ATEJob search - no program -1.02 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗
Panel C:
ATETJob search - no program -0.98 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗
Note:
Table shows DML based point estimates andstandard errors of average effects. ∗ p < ∗∗ p < ∗∗∗ p < .4 Kernel regression CATEs Figure C.4 shows that the kernel regression CATEs detect no substantial heterogeneity forindividuals of different age. Either the cross-validated bandwidth is very large estimatingbasically a constant effect for job search, vocational training and computer courses, or thebandwidth seems too small leading to imprecise and erratic estimates around the meaneffect for language programs.Figure C.4: Effect heterogeneity regarding age −50510 30 40 50
Age C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (a) Job search −50510 30 40 50 Age C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (b) Vocational −50510 30 40 50 Age C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (c) Computer −50510 30 40 50 Age C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (d) LanguageDotted line indicates point estimate of the respective average treatment effect. Grey area shows95%-confidence interval. .5 IATEs Figure C.5 documents the extreme IATE predictions obtained using the full sample.Especially the DR-learner produces very extreme estimates for vocational training rangingfrom -209 to 165. Also in this extreme case, the NDR-learner mitigates the problemsubstantially. However, Table C.6 documents that it still produces implausibly high valuesranging from -21 to 23. The out-of-sample prediction of IATEs is thus preferred anddiscussed in the main text.Figure C.5: Boxplot of IATEs estimated by DR- and NDR-learner −200−1000100 J ob s ea r c h V o ca ti on a l C o m pu t e r L a ngu a g e I nd i v i du a li ze d a v e r a g e t r ea t m e n t e ff ec t DRL.oosNDRL.oosDRL.fullNDRL.full
Note:
The figure shows the distribution of IATEs for participating in the program labeledon the x-axis vs. non-participation estimated by the DR-learner (DRL) and the NDR-learner(NDRL). The first two boxplots of a group are obtained using the out-of-sample (oos) procedureof Appendix B and the other two from the full sample. The dashed line indicates the possiblerange of the IATE of [-31,31] to illustrate that several DR-learner estimated IATEs lie outsidethis bound.
Table C.6 and Figure C.6 provide a detailed comparison of the IATEs estimated byDR- and NDR-learner. We see that the differences are mainly driven by few outliers asindicated by the much larger kurtosis for the DR-learner IATEs. However, most of theestimates are quite similar as the correlations of at least 0.88 provided in the last row ofTable C.6 and the scatter plots in Figure C.6 document.52able C.6: Summary statistics of IATE distributions
Job search Vocational Computer LanguageDRL NDRL DRL NDRL DRL NDRL DRL NDRL
Panel A: Out-of-sample
Mean -0.98 -1.00 3.22 3.17 3.55 3.47 2.35 2.36SD 0.97 0.92 1.96 1.71 2.65 2.18 2.05 1.78Minimum -8.91 -5.11 -14.54 -6.41 -58.73 -6.06 -14.80 -5.30Q1 -3.16 -3.09 -1.88 -1.07 -3.65 -1.82 -2.55 -1.74Q25 -1.62 -1.62 2.06 2.08 2.08 2.07 1.07 1.10Q50 -1.02 -1.04 3.24 3.17 3.61 3.45 2.39 2.43Q75 -0.39 -0.43 4.43 4.27 5.16 4.86 3.61 3.63Q99 1.53 1.36 7.89 7.25 9.20 8.75 7.49 6.24Maximum 4.80 5.82 19.39 11.39 29.85 14.16 22.42 8.76Kurtosis 3.71 3.51 5.44 3.51 21.50 3.34 5.09 2.76Correlation 0.99 0.93 0.87 0.93
Panel B: Full sample
Mean -1.02 -1.04 3.21 3.12 3.31 3.09 2.64 2.62SD 1.42 1.31 8.42 3.67 2.32 2.15 1.45 1.42Minimum -12.25 -9.47 -208.94 -20.75 -38.75 -8.82 -5.77 -3.65Q1 -4.60 -4.24 -10.21 -6.59 -2.18 -2.29 -0.82 -0.68Q25 -1.90 -1.88 0.99 0.90 1.92 1.72 1.65 1.61Q50 -1.05 -1.06 3.32 3.18 3.35 3.14 2.72 2.72Q75 -0.16 -0.21 5.65 5.48 4.74 4.50 3.66 3.67Q99 2.55 2.23 16.16 11.49 8.49 8.06 5.77 5.51Maximum 12.40 8.34
Note:
The table provides summary statistics for the distributions of IATEs estimatedby the DR-learner (DRL) and NDR-learner (NDRL). The rows denoted by Q showthe respective quantiles. Correlation is calculated between the DR-learner and theNDR-learner. Bold numbers indicate values that are outside the possible range. (a) Job search (b) Vocational(c) Computer (d) Language
Notes:
Figures show the joint and marginal distributions of IATEs estimated by DR-learner andNDR-learner.
Job search Vocational Computer Language(1) (2) (3) (4)Previous job: unskilled worker 0.97 0.73 0.39 -1.38Past income -1.33 -0.92 -1.13 1.03Mother tongue other than German, French, Italian 0.65 0.71 0.05 -1.28Qualification: some degree -0.85 -0.68 -0.44 1.28Swiss citizen -0.61 -0.68 0.05 1.27Qualification: unskilled 0.77 0.47 0.32 -1.16Fraction of months employed last 2 years -1.02 -0.42 -0.44 0.35Previous job: skilled worker -0.76 -0.47 -0.17 1.02Foreigner with permanent permit 0.32 0.47 -0.16 -0.82Female 0.54 0.09 0.79 -0.51Foreigner with temporary permit 0.47 0.38 0.13 -0.78Married 0.32 0.65 0.20 -0.76Missing sector 0.67 0.07 0.17 -0.57Cantonal GDP p.c. 0.26 -0.66 -0.02 0.25Cantonal unemployment rate (in %) 0.34 -0.61 0.02 0.11Previous job in tertiary sector -0.37 -0.31 0.04 0.60Employability -0.50 -0.59 -0.57 0.26Number of employment spells last 5 years 0.46 0.15 0.04 -0.06Number of unemployment spells last 2 years 0.46 0.09 0.21 -0.11Previous job: manager -0.33 -0.35 -0.31 0.44Age 0.05 0.42 0.43 -0.00Lives in big city 0.23 -0.38 -0.09 -0.10Caseworker age 0.13 0.37 -0.01 0.05Lives in no city -0.34 0.24 0.08 0.10Qualification: semiskilled 0.20 0.34 0.20 -0.29Allocation of unemployed to caseworkers: by region -0.30 0.08 -0.04 -0.13Allocation of unemployed to caseworkers: by occupation 0.12 0.03 0.24 0.30Previous job in primary sector -0.27 0.25 -0.20 -0.16Caseworker female 0.05 -0.25 0.25 -0.05Qualification: skilled without degree 0.13 0.09 0.04 -0.22Caseworker education: above vocational training 0.03 0.13 0.11 0.20Lives in medium city 0.20 0.12 -0.00 -0.02Mother tongue in canton’s language 0.05 0.09 -0.18 -0.11Previous job in secondary sector -0.01 0.17 -0.09 -0.09Allocation of unemployed to caseworkers: by employability -0.16 0.03 0.03 0.06Allocation of unemployed to caseworkers: by industry 0.07 -0.14 -0.01 0.02Caseworker education: tertiary track -0.02 -0.14 -0.13 -0.14Caseworker education: vocational degree -0.14 0.03 -0.13 -0.02Caseworker cooperative 0.03 -0.12 0.13 -0.07Allocation of unemployed to caseworkers: by age -0.03 0.01 0.10 0.02Missing caseworker characteristics 0.03 -0.10 0.09 -0.09Caseworker tenure 0.06 0.05 -0.07 -0.09Caseworker has own unemployemnt experience 0.05 -0.05 -0.06 0.06Previous job: self-employed 0.05 0.02 0.03 0.04Allocation of unemployed to caseworkers: other -0.03 0.02 -0.03 0.00
Note:
Table shows the differences in means of normalized covariates between the fifth and the first quintile of therespective estimated IATE distribution. Variables are ordered according to the largest absolute difference. .6 Optimal treatment assignment Figure C.7: Optimal decision tree of depth three with five covariates
Notes:
Optimal assignment rules estimated following the procedure defined in Section 3.4.Figure C.8: Overlap of cross-validated policy rules with five covariates
Overlap F r ac ti on Depth 1Depth 2Depth 3
Notes:
Figure shows the fraction of cross-validated policies that agree with the full samplepolicy. 56igure C.9: Optimal decision tree of depth three with 16 covariates
Notes:
Optimal assignment rules estimated following the procedure defined in Section 3.4.Figure C.10: Overlap of cross-validated policy rules with 16 covariates
Overlap F r ac ti on Depth 1Depth 2Depth 3