[PDF] Double Machine Learning based Program Evaluation under Unconfoundedness

Abstract

This paper reviews, applies and extends recently proposed methods based on Double Machine Learning (DML) with a focus on program evaluation under unconfoundedness. DML based methods leverage flexible prediction models to adjust for confounding variables in the estimation of (i) standard average effects, (ii) different forms of heterogeneous effects, and (iii) optimal treatment assignment rules. An evaluation of multiple programs of the Swiss Active Labor Market Policy illustrates how DML based methods enable a comprehensive program evaluation. Motivated by extreme individualized treatment effect estimates of the DR-learner method, we propose the normalized DR-learner (NDR-learner) to address this issue. The NDR-learner acknowledges that individualized effect estimates can be stabilized by an individualized normalization of inverse probability weights.

Full PDF

DDouble Machine Learning based Program Evaluationunder Unconfoundedness

Michael C. Knaus † First version: March 9, 2020This version: October 20, 2020

Abstract

This paper reviews, applies and extends recently proposed methods based onDouble Machine Learning (DML) with a focus on program evaluation under uncon-foundedness. DML based methods leverage ﬂexible prediction models to adjust forconfounding variables in the estimation of (i) standard average eﬀects, (ii) diﬀerentforms of heterogeneous eﬀects, and (iii) optimal treatment assignment rules. Anevaluation of multiple programs of the Swiss Active Labor Market Policy illustrateshow DML based methods enable a comprehensive program evaluation. Motivatedby extreme individualized treatment eﬀect estimates of the DR-learner method, wepropose the normalized DR-learner to address this issue.

Keywords:

Causal machine learning, conditional average treatment eﬀects, opti-mal policy learning, individualized treatment rules, multiple treatments, DR-learner

JEL classiﬁcation:

C21 ∗ Financial support from the Swiss National Science Foundation (SNSF) is gratefully acknowledged. Thestudy is part of the project "Causal Analysis with Big Data" (grant number SNSF 407540_166999) of theSwiss National Research Program "Big Data" (NRP 75). I thank Petyo Bonev, Martin Huber, EdwardKennedy, Michael Lechner, Vira Semenova, Anthony Strittmatter, Stefan Wager, and Michael Zimmertfor helpful comments and suggestions. The usual disclaimer applies. † University of St. Gallen. Michael C. Knaus is also aﬃliated with IZA, Bonn, [email protected]. a r X i v : . [ ec on . E M ] O c t Introduction

The adaptation of so-called machine learning to causal inference has been a productivearea of methodological research in recent years. The resulting new methods complementthe existing econometric toolbox for program evaluation along at least two dimensions (seefor recent overviews Athey & Imbens, 2017, 2019; Abadie & Cattaneo, 2018). On the onehand, they provide ﬂexible methods to estimate standard average eﬀects. In particular,they provide a data-driven approach to variable and model selection in studies that relyon an unconfoundedness assumption for identiﬁcation. On the other hand, they enable amore comprehensive evaluation by providing new methods for the ﬂexible estimation ofheterogeneous eﬀects and of optimal treatment assignment rules.This paper considers Double Machine Learning (DML) as a framework for a ﬂexibleand comprehensive program evaluation. The DML framework seems attractive because(i) it can be combined with a variety of standard supervised machine learning methods,(ii) it covers average eﬀects for binary (e.g. Belloni, Chernozhukov, & Hansen, 2014;Belloni, Chernozhukov, Fernández-Val, & Hansen, 2017; Chernozhukov, Chetverikov, etal., 2018), multiple (e.g. Farrell, 2015) as well as continuous treatments (e.g. Kennedy, Ma,McHugh, & Small, 2017; Colangelo & Lee, 2019; Semenova & Chernozhukov, 2020), (iii)it naturally extends to the estimation of heterogeneous treatment eﬀects of diﬀerent formslike canonical subgroup eﬀects, the best linear prediction of eﬀect heterogeneity (Semenova& Chernozhukov, 2020), or nonparametric eﬀect heterogeneity (e.g Fan, Hsu, Lieli, &Zhang, 2019; Zimmert & Lechner, 2019; Foster & Syrgkanis, 2019; Oprescu, Syrgkanis,& Wu, 2019; Kennedy, 2020), and (iv) it can be used to estimate optimal treatmentassignment rules (e.g. Dudik, Langford, & Li, 2011; Athey & Wager, 2017; Zhou, Athey, &Wager, 2018). All these DML based methods have favorable statistical properties and allowthe use of standard tools like t-tests, OLS, kernel regression or supervised machine learningfor estimating causal parameters of interest after ﬂexibly controlling for confounding.This study starts with a review of DML based methods, then applies these methodsin a standard labor economic setting, and comes back to the methods to propose a ﬁxfor a ﬁnite sample problem that occurred in the application. Thus, it contributes to the Also known as exogeneity, selection on observables, ignorability, or conditional independence assumption.

Predicted outcomesPredicted treatmentprobabilities Doublyrobust score Average effectsHeterogeneous effects: • subgroup • best linear prediction • nonparametric Optimal treatmentassignment rules steadily growing literature of causal machine learning for program evaluation in threeways. First, the review highlights that the methods for diﬀerent parameters all build onthe same doubly robust score. The construction of this score might be computationallyexpensive because it requires the estimation of outcomes and treatment probabilities viamachine learning methods. However, the score might be reused for a variety of additionalparameters of interest once constructed (see Figure 1 for a summary). This makes DMLbased methods particularly attractive for researchers who want to avoid using diﬀerentframeworks for diﬀerent parameters as the set of methods increases that integrate machinelearning in the estimation of average treatment eﬀects (e.g. van der Laan & Rubin, 2006;Athey, Imbens, & Wager, 2018; Avagyan & Vansteelandt, 2017; Tan, 2018; Ning, Peng, &Imai, 2018), heterogeneous treatment eﬀects (e.g. Tian, Alizadeh, Gentles, & Tibshirani,2014; Athey & Imbens, 2016; Wager & Athey, 2018; Athey, Tibshirani, & Wager, 2019;Künzel, Sekhon, Bickel, & Yu, 2019) and optimal treatment assignment (e.g. Bansak etal., 2018; Kallus, 2018).Second, we use DML based methods to provide a comprehensive and computation-ally convenient evaluation of four programs of the Swiss Active Labour Market Policy(ALMP) in a standard dataset (Huber, Lechner, & Mellace, 2017). The evaluation inthis paper illustrates the potential of DML based methods for program evaluations underunconfoundedness and provides a potential blueprint for similar analyses. This adds toa small but steadily growing literature that applies causal machine learning to programevaluation in general (e.g. Bertrand, Crépon, Marguerie, & Premand, 2017; Davis & Heller,2017; Strittmatter, 2018; Farbmacher, Heinrich, & Spindler, 2019; Gulyas & Pytka, 2019;Knittel, 2019) and to evaluations based on unconfoundedness in particular (e.g. Knaus,4018; Jacob, Härdle, & Lessmann, 2019; Kreif & DiazOrdaz, 2019; Cockx, Lechner, &Bollens, 2020; Knaus, Lechner, & Strittmatter, 2020a).Third, we contribute to the methodological literature on the ﬂexible estimation ofindividualized treatment eﬀects (see for a recent overview Knaus, Lechner, & Strittmatter,2020b) by proposing the normalized DR-learner (NDR-learner), which builds on the recentDR-learner of Kennedy (2020). The application reveals that the plain DR-learner producesfew extreme eﬀect estimates. However, a normalization similar to the popular Hájek (1971)normalization for inverse probability weighting is shown to stabilize the estimates. Theincreased stability comes at the price that the NDR-learner limits the class of applicablemachine learning methods to linear smoothers (e.g. Random Forests, Ridge or Post-Lasso).Overall, we ﬁnd that DML based methods provide a promising set of methods forprogram evaluation. The estimated average program eﬀects are in line with the previousliterature. We ﬁnd that computer, vocational and language courses increase employmentin the 31 months after programs start, while the eﬀects of job search trainings aremostly negative. The heterogeneity analysis reveals substantial heterogeneities by gender,nationality, previous labor market success and qualiﬁcation. These are picked up by theestimated optimal assignment rules.The paper proceeds as follows. Section 2 deﬁnes the estimands of interest and theiridentiﬁcation under unconfoundedness. Section 3 reviews DML based methods for esti-mation and introduces the NDR-learner. Section 4 presents the application. Section 5describes the implementation of the methods. Section 6 reports the results. Section 7concludes. Appendices A to C provide additional explanations and results. The R-package causalDML implements the applied estimators. A notebook replicates the main results.

We deﬁne the estimands of interest in the multiple treatment version of the potentialoutcomes framework (Rubin, 1974; Imbens, 2000; Lechner, 2001). Let W = { , ..., T } denote a set of programs and D i ( w ) = ( W i = w ) a binary variable indicating in which5rogram individual i ( i = 1 , ..., N ) is actually observed. We assume that each individualhas a potential outcome Y i ( w ) for all w ∈ W . Without loss of generality, the discussionbelow assumes that higher outcome values are desirable.The ﬁrst estimand of interest is the average potential outcome (APO), γ w = E [ Y i ( w )].It answers the question about the average outcome if the whole population was assignedto program w . However, the more interesting question is usually to compare diﬀerentprograms w and w . To this end, we take the diﬀerence of the according individual potentialoutcomes, Y i ( w ) − Y i ( w ), and aggregate them to diﬀerent estimands: First, the averagetreatment eﬀect (ATE), δ w,w = E [ Y i ( w ) − Y i ( w )]. Second, the average treatment eﬀect onthe treated (ATET), θ w,w = E [ Y i ( w ) − Y i ( w ) | W i = w ]. Third, the conditional averagetreatment eﬀect (CATE), τ w,w ( z ) = E [ Y i ( w ) − Y i ( w ) | Z i = z ], where Z i ∈ Z is a vectorof observed pre-treatment variables. The diﬀerent aggregations accommodate the notion that treatment eﬀects might beheterogeneous. ATE represents the average eﬀect in the population, while ATET showsit for the subpopulation that is actually observed in program w . Thus, the comparisonof ATE and ATET can be informative about the quality of the program assignmentmechanism. For example, ATET being larger than ATE indicates that the observedprogram assignment is better than random.The ATET is deﬁned by the observed program assignment and thus not subject tothe choice of the researcher. In contrast, the conditioning variables Z i of the CATE arespeciﬁed by the researcher to investigate potentially heterogeneous eﬀects across the groupsof individuals that are deﬁned by diﬀerent values of Z i . Such heterogeneous eﬀects can beindicative for underlying mechanisms. Further, CATEs characterize which groups win andwhich lose by how much by receiving program w instead of w .The diﬀerent average eﬀects above provide a comprehensive evaluation of programsunder the current program assignment policy. In many applications, however, we wantto conclude the analysis with a recommendation how the assignment policy could be For DML based estimation with continuous treatments see, e.g Kennedy et al. (2017), Colangelo and Lee(2019) and Semenova and Chernozhukov (2020). This would be Y i (1) − Y i (0) in the canonical binary treatment setting. We focus in this study on expectations of the individual treatment eﬀects. DML based methods forquantile treatment eﬀects can be found, e.g. in Belloni et al. (2017) and Kallus, Mao, and Uehara (2019). π ( Z i ) be a policy that assigns individuals to programs according to their charac-teristics Z i or, put more formally, the function π ( Z i ) maps observable characteristics to aprogram: π : Z → W . In principle, the policy rule can be completely ﬂexible and in theideal world we would assign each individual to the program with the highest conditionalAPO, E [ Y i ( w ) | Z i = z ]. However, in many cases we want to restrict the set of candidatepolicy rules denoted by Π to be interpretable for the communication with decision makersor to incorporate costs or fairness constraints. Each of these candidate policy rules hasa policy value function denoted by Q ( π ) = E [ Y i ( π ( Z i ))] = E [ P w ( π ( Z i ) = w ) Y i ( w )]. Q ( π ) quantiﬁes the average population outcome if policy rule π would be used to assignprograms. The estimand of interest is then the optimal policy rule π ∗ with the highestvalue function for the set of candidate policy rules, or formally π ∗ = arg max π ∈ Π Q ( π ). The previous section deﬁned the estimands of interest in terms of potential outcomes.However, each individual is only observed in one program. Thus, only one potential outcomeper individual is observable and the other potential outcomes remain latent. This is thefundamental problem of causal inference (Holland, 1986) and we need further assumptionsto identify the estimands of interest. In this paper, we consider the unconfoundednessassumption that assumes access to a vector of pre-treatment variables X i ∈ X containing Z i such that the following standard assumptions hold (e.g. Imbens & Rubin, 2015): Assumption 1 (a) Unconfoundedness: Y i ( w ) ⊥⊥ W i | X i = x , ∀ w ∈ W , and x ∈ X .(b) Common support: < P [ W i = w | X i = x ] ≡ e w ( x ) , ∀ w ∈ W and x ∈ X . c) Stable Unit Treatment Value Assumption (SUTVA): Y i = Y i ( W i ) . The unconfoundedness assumption requires that X i contains all confounding variablesthat jointly aﬀect program assignment and the outcome. Common support states that itmust be possible to observe each individual in all programs. SUTVA rules out interference.These assumptions allow the identiﬁcation of the average potential outcome (APO)conditional on confounders in three common ways: E [ Y i ( w ) | X i = x ] = E [ Y i | W i = w, X i = x ] ≡ µ ( w, x ) (1)= E " D i ( w ) Y i e w ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x (2)= E " µ ( w, x ) + D i ( w )( Y i − µ ( w, x )) e w ( x ) | {z } ≡ Γ( w, x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x (3)Equation 1 shows that the conditional APO is identiﬁed as a conditional expectation ofthe observed outcome. Equation 2 shows that it is identiﬁed by reweighting the observedoutcome with the inverse treatment probability. Finally, Equation 3 adds the reweightedoutcome residual to the conditional outcome representation of Equation 1. This seemsredundant because we can check that the reweighted residual has expectation zero underunconfoundedness. However, this identiﬁcation result is doubly robust in the sense that itstill holds if we replace either µ ( w, x ) or e w ( x ) in Equation 3 by arbitrary functions of x . This doubly robust structure plays a crucial role for the estimation procedures that wediscuss in the next section.From an identiﬁcation perspective, Γ( w, x ) deﬁned in Equation 3 suﬃces to identifyall estimands of interest stated in the previous subsection:• APO: γ w = E [ Y i ( w )] = E [Γ( w, X i )]• ATE: δ w,w = E [ Y i ( w ) − Y i ( w )] = E [Γ( w, X i ) − Γ( w , X i )]• ATET: θ w,w = E [ Y i ( w ) − Y i ( w ) | W i = w ] = E [Γ( w, X i ) − Γ( w , X i ) | W i = w ]• CATE: τ w,w ( z ) = E [ Y i ( w ) − Y i ( w ) | Z i = z ] = E [Γ( w, X i ) − Γ( w , X i ) | Z i = z ] Appendix A reviews identiﬁcation and identiﬁcation double robustness of Equation 3 for completeness. Q ( π ) = E [ Y i ( π ( Z i ))] = E [ P w ( π ( Z i ) = w )Γ( w, X i )]• Optimal policy: π ∗ = arg max π ∈ Π Q ( π ) = arg max π ∈ Π E [ P w ( π ( Z i ) = w )Γ( w, X i )] All Double Machine Learning (DML) based estimators for the estimands of interestdiscussed in the following build on the doubly robust scores of Robins, Rotnitzky, andZhao (1994, 1995). In the following, large Greek letters denote the scores corresponding tothe small Greek letters used to deﬁne the estimands in Section 2.1.The construction of the doubly robust scores requires the input of so-called nuisanceparameters that are usually of secondary interest and considered as tool to eventu-ally obtain the parameters of interest. In our case, the two nuisance parameters are µ ( w, x ) = E [ Y i | W i = w, X i = x ] and e w ( x ) = P [ W i = w | X i = x ] for all w . µ ( w, x ) isthe conditional outcome mean for the subgroup observed in program w . e w ( x ) is theconditional probability to be observed in program w , also known as the propensity score.Usually these functions are unknown and need to be estimated. Following Chernozhukov,Chetverikov, et al. (2018) they are estimated based on K -fold cross-ﬁtting: (i) randomlydivide the sample in K folds of similar size, (ii) leave out fold k and estimate models forthe nuisance parameters in the remaining K − µ − k ( w, x ) and ˆ e − kw ( x ) in the left out fold k , and (iv) repeat (i) to (iii) such that eachfold is left out once. This procedure avoids overﬁtting in the sense that no observationis used to predict its own nuisance parameters. To avoid notational clutter, we ignorethe dependence on the speciﬁc fold in the following notation and refer to the cross-ﬁttednuisance parameters as ˆ µ ( w, x ) and ˆ e w ( x ).The main building block of the following estimators is the doubly robust score of the APO , which replaces the true nuisance parameters in Equation 3 by their cross-ﬁttedpredictions: ˆΓ i,w = ˆ µ ( w, X i ) + D i ( w )( Y i − ˆ µ ( w, X i ))ˆ e w ( X i ) . (4)9he ATE score for the comparison of treatment w and w is then constructed as thediﬀerence of the respective APO scores:ˆ∆ i,w,w = ˆΓ i,w − ˆΓ i,w (5)The only estimator we consider that uses the same nuisance parameter but plugs theminto a diﬀerent score is the ATET estimator. Although the identiﬁcation result with thedoubly robust APO score in the previous section holds, it is not doubly robust. However,the doubly robust score for the ATET exists and is deﬁned asˆΘ i,w,w = D i ( w )( Y i − ˆ µ ( w , X i ))ˆ e w − D i ( w )ˆ e w ( X i )( Y i − ˆ µ ( w , X i ))ˆ e w ˆ e w ( X i ) , (6)where ˆ e w = N w /N is the unconditional treatment probability with N w counting thenumber of individuals observed in program w (see also, e.g. Farrell, 2015). The estimation of the APOs, ATEs and ATETs boils down to taking the means of thepreviously deﬁned doubly robust scores. For statistical inference, we can rely on standardone-sample t-tests. Thus, the score’s mean and the variance of this mean are the pointand the variance estimate of the respective estimand of interest:• APO: ˆ µ w = N − P i ˆΓ i,w and ˆ σ µ w = N − P i (ˆΓ i,w − ˆ µ w ) • ATE: ˆ δ w,w = N − P i ˆ∆ i,w,w and ˆ σ δ w,w = N − P i ( ˆ∆ i,w,w − ˆ δ w,w ) • ATET: ˆ θ w,w = N − P i ˆΘ i,w,w and ˆ σ θ w,w = N − P i ( ˆΘ i,w,w − ˆ θ w,w ) Note that the estimated variances require no adjustment for the fact that we haveestimated the nuisance parameters in a ﬁrst step. The resulting estimators are consistent,asymptotically normal and semiparametrically eﬃcient under the main assumption that theestimators of the cross-ﬁtted nuisance parameters are consistent and converge suﬃcientlyfast (Belloni et al., 2014; Farrell, 2015; Belloni et al., 2017; Chernozhukov, Chetverikov,et al., 2018). In particular, the product of the convergence rates of the outcome and10ropensity score estimators must be at least n / . This allows to apply machine learning toestimate the nuisance parameters. Flexible machine learning estimators converge usuallyslower than the parametric rate n / but several are known to be able to achieve n / ,which would be suﬃciently fast if both nuisance parameter estimators achieve it. It is well known that estimators using doubly robust scores and parametric modelsfor the nuisance parameters are doubly robust in the sense that they remain consistentif one of the parametric models is misspeciﬁed (see, e.g. Glynn & Quinn, 2009). Thediﬀerence of the DML version is that it exploits what Smucler, Rotnitzky, and Robins(2019) call ’rate double robustness’. This robustness allows to estimate the parametersof interest at the parametric rate n / even if the nuisance parameters are estimated atslower rates using machine learning methods that do not require the speciﬁcation of anactual parametric model. We can reuse the ATE score of Equation 5 to estimate conditional eﬀects. In the following,we discuss estimators that exploit the fact that the conditional expectation of the scorewith known nuisance parameters equals CATE: τ w,w ( z ) = E [∆ i,w,w | Z i = z ]. Thus, anatural way to estimate CATEs is to use the score with estimated nuisance parameters,ˆ∆ i,w,w , as pseudo-outcome in standard regression frameworks.We consider two special cases of CATEs following Knaus et al. (2020b). (i) Groupaverage treatment eﬀects (GATEs) provide the average eﬀects for pre-speciﬁed, usuallylow-dimensional, groups and are thus equivalent to the standard subgroup analysis compar-ing, e.g., men and women. (ii) Individualized average treatment eﬀects (IATEs) aim for themost detailed eﬀect heterogeneity that considers all confounders as heterogeneity variables( τ w,w ( x ) = E [ Y i ( w ) − Y i ( w ) | X i = x ]). We review recently proposed estimators for the Further results, regularity conditions and discussions can be found in section 5.1 of Chernozhukov,Chetverikov, et al. (2018). For example, versions of Lasso (Belloni & Chernozhukov, 2013), Boosting (Luo & Spindler, 2016),Random Forests (Wager & Walther, 2015; Syrgkanis & Zampetakis, 2020), Neural Nets (Farrell, Liang,& Misra, 2018), forward model selection (Kozbur, 2020) or ensembles of those can be shown to achievethe required rates under conditions stated in the original papers. Note that this does not work for the ATET score in Equation 6 and suitable adaptations are beyond thescope of this paper.

Semenova and Chernozhukov (2020) propose to use the pseudo-outcome in an OLSregression and to minimizeˆ β w,w = arg min β N X i =1 (cid:16) ˆ∆ i,w,w − ˜ Z i β w,w (cid:17) , where ˜ Z i contains the original Z i and a constant. The resulting coeﬃcients ˆ β w,w havethe same interpretation as in a standard OLS model. The only diﬀerence is that insteadof linearly modelling the level of an outcome, they model the level of a causal eﬀect.Consequently, the ﬁtted values estimate GATEs if we specify a fully saturated OLS model.Otherwise, the ﬁtted values provide the best linear predictor (BLP) of the CATE. Mostimportantly, Semenova and Chernozhukov (2020) show that standard heteroscedasticityrobust standard errors are valid and that we can again ignore the fact that the nuisanceparameters are estimated and potentially converge slower than n / . A complementary option for few continuous Z i is proposed by Fan et al. (2019) andZimmert and Lechner (2019). The pseudo-outcome can also be used in nonparametrickernel regressions (KR): ˆ τ npw,w ( z ) = N X i =1 K h ( Z i − z ) ˆ∆ i,w,w P Ni =1 K h ( Z i − z )where K h ( · ) is a suitable kernel function with bandwidth h . Fan et al. (2019) andZimmert and Lechner (2019) show that, like in the OLS case, the uncertainty of thenuisance parameter estimation can be neglected and standard statistical inference forkernel regression applies. However, there is a price to pay for this ﬂexibility in terms of Formally deﬁned as ˆ τ olsw,w ( z ) = h ˜ z, ˆ β w,w i , where h· , ·i denotes the inner product. n / convergence. For kernel regressions this requirementdepends on the dimension of Z i . For example, the product of the convergence rates need toachieve n / for a one dimensional continuous Z i and further increases with more variables. IATEs may be estimated using the pseudo-outcome in supervised machine learning regres-sions. Kennedy (2020) calls the resulting class of estimators DR-learner and shows thatthe doubly robust structure of the ATE score results in favorable error bounds that wouldnot be attainable by outcome regression or IPW based methods alone. We consider twovariants of the DR-learner. First, we follow the logic of the previous two subsections anduse the full sample as pseudo-outcome in one supervised machine learning regression toestimate IATEs in sample.This full sample procedure is computationally convenient but prone to overﬁtting.Thus, the second variant aims for out-of-sample IATE predictions for each individual inthe sample. Following Algorithm 1 of Kennedy (2020), this requires a diﬀerent cross-ﬁttingscheme then the one described in Section 3.1: (i) randomly split the sample in four parts,(ii) use the ﬁrst part to estimate the propensity score model, (iii) use the second part toestimate the outcome regression models, (iv) use the propensity score and outcome modelsto predict the nuisance parameters in the third part and construct the pseudo-outcomeˆ∆ i,w,w for this part, (v) regress ˆ∆ i,w,w on the covariates of the third part to estimateIATEs, and (vi) use the obtained model to predict IATEs in the fourth part. Each ofthe ﬁrst three parts can play each role of steps (ii) to (v) once and the resulting threeIATE models are then averaged to provide IATE predictions of the fourth part. Finally,we can iterate such that we receive out-of-sample predictions for each fold (see Algorithm1 in the Appendix for details). The computational downside of this procedure is that wecannot reuse the same nuisance parameter predictions as for the average estimator and The Orthogonal Random Forest of Oprescu et al. (2019) is another estimator that is based on thepseudo-outcome idea and can be asymptotically normal under the assumption of parameteric nuisanceparameters. We focus in this paper on the more general DR-learner. Z i and impossible for high-dimensional Z i asdiscussed for example by Chernozhukov, Demirer, Duﬂo, and Fernandez-Val (2017). The DR-learner shares the problem of all estimators that involve reweighting by theinverse of the propensity score. The inverse probability weights do not sum to one in ﬁnitesamples. We therefore propose to adapt the idea of Hájek (1971) and to normalize theinverse probability weights to sum to one. This normalization is recommended to stabilizeestimators for average eﬀects (e.g. Imbens, 2004; Lunceford & Davidian, 2004; Robins,Sued, Lei-Gomez, & Rotnitzky, 2007; Busso, DiNardo, & McCrary, 2014). However, itcould play an even bigger role in the estimation of conditional eﬀects as ﬁnite sampleimbalances are more likely to occur on the individualized level. Thus, we propose thenormalized DR-learner (NDR-learner) as a stabilized complement to the DR-learner.The NDR-learner is less ﬂexible than the DR-learner in the sense that it requires toapply linear smoothers (e.g. Buja, Hastie, & Tibshirani, 1989, and references therein) toestimate IATEs. However, this restriction allows still to use popular machine learningmethods like tree-based methods (regression trees, Random Forests or boosted trees),Ridge or any method that runs OLS after variable selection like Post-Lasso (Belloni &Chernozhukov, 2013). Note further that the nuisance parameters can still be estimatedwith supervised machine learning methods that are non-linear smoothers.Linear smoothers can be represented as linear combination of (pseudo-)outcomes.This means, we know the weight α i ( x ) that each individual (pseudo-)outcome receives inpredicting the (pseudo-)outcome at x . When such weights are available, the DR-learner14stimated IATE can be expressed asˆ τ drlw,w ( x ) = N X i =1 α i ( x ) ˆ∆ i,w,w = N X i =1 α i ( x )[ˆ µ ( w, X i ) − ˆ µ ( w , X i )]+ N X i =1 α i ( x ) D i ( w )ˆ e w ( X i ) | {z } λ wi ( x ) ˜ Y i ( w, X i ) − N X i =1 α i ( x ) D i ( w )ˆ e w ( X i ) | {z } λ w i ( x ) ˜ Y i ( w , X i ) , (7)where ˜ Y i ( w, X i ) = Y i − ˆ µ ( w, X i ) denotes the individual speciﬁc outcome residual oftreatment arm w . In ﬁnite samples, λ wi ( x ) and λ w i ( x ) usually do not sum to one. This isespecially problematic if it sums to something much greater then one. In this case theweighted residuals receive much more weight then the outcome regressions. This mightresult in implausibly large eﬀect estimates that could even fall outside of the possiblebounds of a given outcome variable (Kang & Schafer, 2007; Robins et al., 2007). The NDR-learner normalizes the weights to sum to one:ˆ τ ndrlw,w ( x ) = N X i =1 α i ( x )[ˆ µ ( w, X i ) − ˆ µ ( w , X i )]+ N X i =1 λ wi ( x ) ! − N X i =1 λ wi ( x ) ˜ Y i ( w, X i ) − N X i =1 λ w i ( x ) ! − N X i =1 λ w i ( x ) ˜ Y i ( w , X i ) (8)This is more demanding from a computational point of view because it requires to calculatethe weights α i ( x ) and the normalization for each x of interest (Algorithm 2 providesthe details of the implementation). However, the application below shows that thenormalization deals well with the cases where outcome residuals receive high weightsleading to implausibly large eﬀect estimates. Thus, the NDR-learner is an interestingalternative to the DR-learner if eﬀect sizes become suspicious. For bounded outcomes, the eﬀects must lie in the interval [ Y min − Y max , Y max − Y min ], with Y min and Y max denoting the minimum and maximum values of the outcome, respectively. .4 Optimal treatment assignment The APO score of Section 3.1 can also be reused to estimate optimal treatment assignment.To this end, note that the value function of any policy rule π ( Z i ) can be estimated asˆ Q ( π ) = N − N X i =1 T X w =0 ( π ( Z i ) = w )ˆΓ i,w . This means each individual contributes the score of the treatment that she is assignedto under this policy rule. However, we are not necessarily interested in the value functionof some policy rule, but want to estimate the optimal policy rule that maximizes this valuefunction, ˆ π ∗ = arg max π ∈ Π ˆ Q ( π ). This requires to search over all candidate policy rules toﬁnd the optimum as there exists no closed form solution. Example:

Consider the case where Z i is a binary covariate and W i is a binary treatment.We have four diﬀerent policy rules: treat nobody ( π ), treat only those with Z i = 1 ( π ),treat only those with Z i = 0 ( π ), or treat everybody ( π ). We illustrate this using tworepresentative observations, i = 1 with Z = 0, and i = 2 with Z = 1 in Table 1. Thecolumns three to six show the assignments under the four potential assignment rules. Forexample, the ﬁrst observation receives no treatment under policy rules π and π , but istreated under policy rules π and π . To ﬁnd the optimal rule, we compare the means ofthe APO scores in the last four columns and pick the policy rule that corresponds to thelargest mean. The number of policy values to compare increases dramatically in settingswith multiple treatments and Z i being a vector of potentially non-binary variables.Table 1: Example of DML based optimal treatment assignment i Z i π π π π ˆ Q ( π ) ˆ Q ( π ) ˆ Q ( π ) ˆ Q ( π )1 0 0 0 1 1 ˆΓ , ˆΓ , ˆΓ , ˆΓ , , ˆΓ , ˆΓ , ˆΓ , ... ... ... ... ... ... ... ... ... ...We expect that the estimated policy in ﬁnite samples and with estimated nuisanceparameters does not coincide with the true optimal policy rule. This is conceptualized asthe ’regret’ deﬁned as the diﬀerence between the true and the estimated optimal value16unction, R (ˆ π ∗ ) = Q ( π ∗ ) − Q (ˆ π ∗ ).Zhou et al. (2018) show that the DML based procedure minimizes the maximum regretasymptotically under two main conditions: First, the same convergence conditions for thenuisance parameters that are required for ATE estimation (the product of the nuisanceparameter convergence rates achieves n / ). Second, the set of candidate policy rules Πis not too complex. In particular, Zhou et al. (2018) show that decision trees with ﬁxeddepth are a suitable class of policy rules. Again the double robustness of the used scoresresults in statistical guarantees that are not achievable for methods based on outcomeregressions or IPW alone. We use a standard dataset of Swiss Active Labor Market Policy (ALMP) that is alreadybasis of previous studies (Huber et al., 2017; Lechner, 2018; Knaus et al., 2020a) toestimate the eﬀect of diﬀerent programs on employment. In particular, we start withthe sample of 100,120 unemployed individuals of Huber et al. (2017) that consists of24 to 55 year old individuals registered unemployed individuals in 2003. We considernon-participants and participants of four diﬀerent program types : job search, vocationaltraining, computer programs and language courses. As the assignment policies diﬀersubstantially across the three language regions, we focus only on individuals living in theGerman speaking part and remove those in the French and Italian speaking part to avoidcommon support problems.This leaves us with 67,577 observations. We evaluate the ﬁrst program participationwithin the ﬁrst six months after the begin of the unemployment spell. One problemof this deﬁnition is that non-participants comprise people that quickly come back intoemployment before they would be assigned to a training program. This could result in anoverly optimistic evaluation of non-participation. We follow Lechner (1999) and Lechner Gerﬁn and Lechner (2002), Lalive, van Ours, and Zweimüller (2008) and Knaus et al. (2020a) amongothers provide a more detailed description of the surrounding institutional setting. The dataset is available as restricted use ﬁle via FORSbase (ref study: 13867). The dataset contains also participants of an employment program and personality training. However,we leave them out to keep the number of obtained results manageable.

No program Job search Vocational Computer Language(1) (2) (3) (4) (5)No. of observations 47,653 11,610 858 905 1504Outcome: months employed of 31 14.7 14.4 18.4 19.2 13.5Female (binary) 0.44 0.44 0.33 0.60 0.55Age 36.61 37.31 37.45 39.08 35.28Foreigner (binary) 0.37 0.33 0.30 0.21 0.67Employability 1.93 1.98 1.93 1.97 1.85Past income in CHF 10,000 4.25 4.67 4.87 4.32 3.73

Note:

Employability is an ordered variable with one indicating low employability, two mediumemployability and three high employability. The exchange rate USD/CHF was roughly 1.3 at thattime. The full set of variables is reported in Table C.1. and Smith (2007) and assign pseudo program starting points to the non-participants andkeep only those who are still unemployed at this point. This results in a ﬁnal samplesize of 62,530 observations.The outcome of interest is the cumulated number of months in employment in the 31months after program start, which is the maximum available time span in the dataset.Row one of Table 2 provides the number of observations in each group. Roughly 75%participate in no program. By far the largest program is the job search program, whichis also called basic program. The more speciﬁc programs are much smaller with roughly1000 observations each. Row two shows that the average outcomes substantially diﬀerby diﬀerent groups. However, it is not clear whether this is only due to selection eﬀectsbecause the observable characteristics are not comparable across groups, as the remainingrows show. Especially the share of females, the share of foreigners and past incomediﬀer quite substantially across programs. The control variables comprise 45 variablesthat are reported in Table C.1. They consist of socio-economic characteristics of theunemployed individuals, caseworker characteristics, information about the assignmentprocess, information about the previous job and regional economic indicators. The assignment of the pseudo starting point is based on estimated probabilities to start a program at aspeciﬁc time. The probability depends also on covariates and is estimated using the same random forestspeciﬁcation that is discussed later in Section 5. Implementation

We estimate the nuisance parameters via Random Forest (Breiman, 2001) using theimplementation with honest splitting in the grf

R-package (Athey et al., 2019) and5-fold cross-ﬁtting. The tuning parameters in each regression are selected by out-of-bagvalidation. All regressions apply the full set of control variables listed in Table C.1. We runthe outcome regressions for each treatment group separately to obtain ˆ µ ( w, x ). Also thepropensity scores are separately estimated for each treatment using a treatment indicatoras outcome in the random forest. The propensity scores are then normalized to sum toone within an individual.We estimate CATEs at diﬀerent granularity. First, we investigate GATEs for subgroupsby gender, foreigners and three categories of employability. These are regularly used in theprogram evaluation literature and usually investigated by re-estimating everything in thesubgroups. However, it can be performed at very low computational costs after DML foraverage eﬀects using only a standard OLS regression with the pseudo-outcome as describedin Section 3.3.1 and dummy variables for all groups but the reference group as covariates.Second, we estimate kernel regression CATEs for the continuous variables age and pastincome based on the R-package np (Hayﬁeld & Racine, 2008). The kernel regressions applya second-order Gaussian kernel function and use 0.9 of the cross-validated bandwidth forundersmoothing as suggested by Zimmert and Lechner (2019). Third, we specify an OLSmodel in which all the ﬁve previously used variables enter linearly. Finally, we go beyondthe handpicked variables and estimate the IATEs using all 45 control variables in theDR-learner and the NDR-learner. Both are implemented with the honest Random Forestbecause the grf package allows to extract the prediction weights α i ( x ) required for theNDR-learner. We apply both variants described in Section 3.3.3. Once we estimate theIATE for each observation using DR- and NDR-learner in the full sample and once wepredict them out-of-sample. For the latter, Appendix B provides a detailed description ofthe underlying DR- and NDR-learner algorithms.The optimal treatment assignment rule is estimated as decision trees of depth one,two and three. We follow Algorithm 2 for exact tree-search of Zhou et al. (2018) that is19able 3: Steps of implementation Step Input Operation Output1. W i , X i Predict treatment probabilities ˆ e w ( x )2. Y i , W i , X i Predict treatment speciﬁc outcomes ˆ µ ( w, x )3. Y i , W i , ˆ e w ( x ), ˆ µ ( w, x ) Plug into Equation 4 ˆΓ i,w

4. ˆΓ i,w

Mean, one-sample t-test APOs5. ˆΓ i,w

Take diﬀerence ˆ∆ i,w,w

6. ˆ∆ i,w,w Mean, one-sample t-test ATEs7. ˆ∆ i,w,w , Z i Ordinary least squares GATEs or BLP CATEs8. ˆ∆ i,w,w , Z i Kernel regression KR CATEs9. ˆ∆ i,w,w , X i Supervised Machine Learning IATEs10. ˆΓ i,w , Z i Optimal decision tree Optimal treament rule implemented in the policytree

R-package (Sverdrup, Kanodia, Zhou, Athey, & Wager,2020). We estimate the trees ﬁrst with the ﬁve handpicked variables. However, thesevariables include gender and foreigner status that might be too sensitive to include inpractice. Thus, we investigate another set of 16 variables that includes only the objectivemeasures of education and labor market history of the unemployed persons that would beavailable to the caseworker from the administrative records.Table 3 summarizes all required implementation steps. It highlights that a comprehen-sive DML based program evaluation can be run with few lines of code in any statisticalsoftware program that is capable of the operations in the third column. Thus, researcherscan build their customized analyses in a modular fashion based on established code. Alter-natively, the R-package causalDML already implements the required steps as showcased inthe replication notebook accompanying this paper.

We focus here on the eﬀect estimates and discuss the nuisance parameters in AppendixC.2. Throughout this section, we compare the four programs to non-participation. Recall that the outcome of interest is the cumulated number of months employed in the31 months after program start. Figure 2 depicts ATE and ATET estimates and shows The underlying APOs are shown in Figure C.2 of Appendix C. J ob s ea r c h V o ca ti on a l C o m pu t e r L a ngu a g e A v e r a g e t r ea t m e n t e ff ec t s Estimand

ATEATET

Note:

The ﬁgure shows the point estimates of the average treatment eﬀects of participatingin the program labeled on the x-axis vs. non-participation and their 95% conﬁdence intervals.Numeric results in Panels B and C of Table C.5. substantial diﬀerences in the eﬀectiveness of programs. The job search program decreasesthe months in employment on average by about one month. In contrast, other programsthat teach hard skills show substantial improvements with roughly three additional monthsin employment on average. Comparing ATE and ATET shows no big diﬀerences for most programs. This suggeststhat there is either no eﬀect heterogeneity correlated with observables or that the assignmentdoes not take advantage of this heterogeneity. We would expect to see ATETs being higherthan ATEs if program assignment is well targeted. However, we ﬁnd only evidence for theopposite as the actual participants of a language course show a 1.5 months lower treatmenteﬀect compared to the population. This diﬀerence suggests that there is substantial eﬀectheterogeneity to uncover and the potential to improve treatment assignment.

This subsection studies eﬀect heterogeneity at diﬀerent granularity. We start by estimatinggroup average treatment eﬀects (GATEs). Panel A of Table 4 shows the result of an OLS For a better understanding of the underlying dynamics, Figure C.3 in Appendix C reports and discussesthe eﬀects of program participation on the employment probabilities over time.

Job search Vocational Computer Language(1) (2) (3) (4)

Panel A:

Constant -1.27 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ (0.17) (0.55) (0.60) (0.46)Female 0.57 ∗∗ -1.10 2.53 ∗∗∗ -1.67 ∗∗ (0.25) (0.87) (0.86) (0.76) Panel B:

Constant -1.28 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ (0.16) (0.53) (0.50) (0.51)Foreigner 0.73 ∗∗∗ ∗∗ -0.80 -2.91 ∗∗∗ (0.26) (0.89) (0.94) (0.71) Panel C:

Constant -0.15 5.36 ∗∗∗ ∗∗∗ ∗∗∗ (0.33) (1.03) (1.09) (0.88)Medium employability -0.94 ∗∗∗ -2.28 ∗∗ -2.63 ∗∗ -0.17(0.36) (1.15) (1.20) (0.98)High employability -1.70 ∗∗∗ -4.62 ∗∗∗ -3.29 ∗ ∗∗∗ ∗∗ ∗ Note:

This table shows OLS coeﬃcients and their heteroscedasticity robuststandard errors (in parentheses) of regressions run with the pseudo-outcomedeﬁned as described in Section 3.3. ∗ p < ∗∗ p < ∗∗∗ p < regression with a female dummy as covariate, ˆ∆ i,w,w = β + β f emale i + error i . Theconstant ( β ) provides the GATE for the reference group men and the female coeﬃcient( β ) describes how much the GATE diﬀers for women. The results show substantialgender diﬀerences in the eﬀectiveness of programs. Women signiﬁcantly suﬀer less orproﬁt more from job search and computer program participation. This gender gap in theeﬀectiveness of ALMPs is also well-documented in the literature (Crépon & van den Berg,2016; Card, Kluve, & Weber, 2018). In contrast to this, we ﬁnd that women proﬁt onaverage signiﬁcantly less from language courses than men.Panel B replaces the female dummy in the regression by a foreigner dummy. Strikingly,Swiss citizens as reference group show a big positive eﬀect for participating in languagecourses but the eﬀect disappears for foreigners. After adding the coeﬃcient for foreignersto the constant, the foreigners’ GATE is only 0.71 (3 . − .

91, standard error: 0.62). Acrucial information to better understand this ﬁnding would be to know which languagesthey learn, which is unfortunately not available in this dataset.Panel C shows the results of a similar regression but now with two dummies indicating22igure 3: Eﬀect heterogeneity regarding past income −50510 0 25000 50000 75000

Past income C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (a) Job search −50510 0 25000 50000 75000 Past income C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (b) Vocational −50510 0 25000 50000 75000 Past income C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (c) Computer −50510 0 25000 50000 75000 Past income C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (d) Language Note:

Dotted line indicates point estimate of the respective average treatment eﬀect. Grey areashows 95%-conﬁdence interval. medium and high employability such that low employability becomes the reference group.The F-statistic in the last line tests the joint signiﬁcance of the two dummies. It is statisti-cally signiﬁcant at least at the 10%-level for the programs in the ﬁrst three columns. Theyall show a common gradient that individuals with low employability beneﬁt substantiallymore or at least suﬀer less from program participation.

While subgroup analyses are standard in program evaluations, the estimation of kernelregression CATEs along continuous variables is rarely pursued. We estimate such CATEsalong the continuous variables past income and age and ﬁnd no notable heterogeneity forthe latter. However, eﬀect sizes are clearly associated with past income. Figure 3 shows Figure C.4 in the appendix shows the according results.

Job search Vocational Computer Language(1) (2) (3) (4)Constant -0.49 3.89 5.16 ∗∗ ∗∗ (0.71) (2.38) (2.37) (2.11)Female 0.21 -1.97 ∗∗ ∗∗ -1.31 ∗ (0.27) (0.92) (0.91) (0.79)Age 0.03 ∗ ∗∗ ∗ ∗∗∗ (0.27) (0.90) (0.96) (0.74)Medium employability -0.65 ∗ -1.48 -2.31 ∗ -0.75(0.37) (1.17) (1.22) (1.01)High employability -1.21 ∗∗ -3.29 ∗∗ -2.82 -0.37(0.51) (1.52) (1.72) (1.51)Past income in CHF 10,000 -0.26 ∗∗∗ -0.64 ∗∗∗ -0.42 ∗∗ ∗ (0.06) (0.23) (0.19) (0.18)F-statistic 6.72 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ Note:

This table shows OLS coeﬃcients and their heteroscedasticity robuststandard errors (in parentheses) of regressions run with the pseudo-outcome asdescribed in Section 3.3.1. ∗ p < ∗∗ p < ∗∗∗ p < that eﬀects decrease with higher past income for all but for language programs. The latterhave only a small positive eﬀect for individuals with low past income but it increases withhigher income. One potential explanation for these ﬁndings is that the value of languageskills is larger for high-skilled workers in multilingual countries like Switzerland becausethey reduce information costs across language borders (see, e.g. Isphording, 2014). The CATEs considered so far were nonparametric but only univariate. Now we model theCATE by specifying a multivariate OLS regression with the previously used covariatesentering linearly. It is most likely misspeciﬁed and thus estimates the best linear predictor(BLP) of CATEs with respect to these variables. However, it provides a compact andaccessible summary of the eﬀect heterogeneities. Additionally, it holds the other includedvariables constant. Consider for example the coeﬃcients for being female in Table 5.Compared to Table 4, the coeﬃcients in the ﬁrst three columns are smaller and the one forlanguage courses is larger (for example for job search it is 0.2 instead of 0.6). The reasonis that it represents a partial eﬀect that holds other variables like past income ﬁxed. The24igure 4: Boxplot of out-of-sample predicted IATEs by DR- and NDR-learner −60−40−20020 J ob s ea r c h V o ca ti on a l C o m pu t e r L a ngu a g e I nd i v i du a li ze d a v e r a g e t r ea t m e n t e ff ec t DRLNDRL

Note:

The ﬁgure shows the distribution of IATEs for participating in the program labeled on thex-axis vs. non-participation estimated by the DR-learner (DRL) and the NDR-learner (NDRL).The dashed line indicates the possible range of the IATE of [-31,31] to illustrate that severalDR-learner estimated IATEs lie outside this bound. subgroup female coeﬃcient in Table 4 partly picks up that women have lower past incomeand that lower income is associated with higher treatment eﬀects for all but languagecourses. This example illustrates that the same strategies that are usually applied tointerpret an outcome OLS model can now be used to interpret the eﬀect OLS model.

We focus on the results based on the out-of-sample variant of the DR- and NDR-learneras the full sample variant leads to severe overﬁtting with predicted IATEs ranging from-209 to 165 that are up to seven times larger than what is possible given that the outcomeis bounded between zero and 31. However, Figure 4 shows that the DR-learner producesimpossible eﬀect sizes even out-of-sample, which motivates the proposal of the NDR-learneras stabilized variant. Figure 4 provides boxplots of the predicted IATEs and shows severalsubstantial outliers lying below the smallest possible value of -31. However, the descriptivestatistics provided in Table C.6 and the joint and marginal distributions depicted in FigureC.6 document that besides the outliers, the distributions are quite similar and correlate See Appendix C.5 for results and discussion of the full sample.

Note:

Table shows the diﬀerences in means of standardized covariates between the ﬁfth and the ﬁrstquintile of the respective estimated IATE distribution. with at least 0.87. Not surprisingly, the impact of normalized weights is much larger forthe three smaller programs and nearly negligible for job search programs. Still, we basethe following discussion for all programs on the more stable results of the NDR-learner.We conduct a classiﬁcation analysis as proposed by Chernozhukov, Fernandez-Val, andLuo (2018) to understand which variables are most predictive of eﬀect sizes. To this end,we split the predicted IATE distributions in quintiles and compare the covariate means ofthe observations falling into the ﬁfth and ﬁrst quintile. For comparability, we normalize allcovariates to have mean zero and variance one. Table 6 shows the eight variables that haveat least one absolute diﬀerence between the highest and lowest quintile that is larger thenone standard deviation. For example, we observe that the group with the highest eﬀects(the ﬁfth quintile) of a job search program has a 1.33 standard deviations lower past incomecompared to the lowest IATE group (the ﬁrst quintile). Also the other variables conﬁrmthe patterns that we document already in previous subsections. The eﬀects of job search,vocational and computer training are higher for unskilled workers with lower previouslabor market success and foreigners, while the opposite holds for language programs. The previous section documented substantial heterogeneities in the program eﬀects. Toleverage this heterogeneity for better targeting, we apply the DML based optimal policy Table C.7 shows the classiﬁcation analysis for all variables. (a) Depth 1 & 5 covariates (b) Depth 2 & 5 covariates(c) Depth 1 & 16 covariates (d) Depth 2 & 16 covariates

Notes:

Optimal assignment rules estimated following the procedure deﬁned in Section 3.4. algorithm of Section 3.4. Figure 5a shows the simplest decision tree with only one split forthe ﬁve handpicked covariates. It would allocate men to vocational training and women tocomputer courses. This split is probably similar to what we would have suggested giventhe evidence presented in Table 4. For a tree of depth two, such an eyeballing approachhas its limits and the algorithmic approach provides a systematic way to arrive at anestimated optimal decision tree. The tree in Figure 5b splits ﬁrst on being a foreigner andthen along past income. In the absence of the possibility to split on gender, the depth onetree in Figure 5c splits on past income roughly at the same value where the KR CATEs ofcomputer and language training intersect in Figure 3. Panel A of Table 7 summarizes the results of the diﬀerent trees. It shows the percentageof individuals that are placed in the diﬀerent programs. Not surprisingly, all individuals arerecommended to be placed into one of the three positively evaluated hard skill enhancingprograms.One yet unsolved challenge is how to draw statistical inference about the quality andstability of the decision trees. Athey and Wager (2017) propose a form of cross-validation.To this end, we use the same folds that were used in the cross-ﬁtting procedure to estimate Appendix C.6 provides also the trees of depth three.

No program Job search Vocational Computer Language(1) (2) (3) (4) (5)

Panel A: Percent allocated to program

Depth 1 & 5 variables 0 0 56 44 0Depth 2 & 5 variables 0 0 33 47 19Depth 3 & 5 variables 0 0 34 45 21Depth 1 & 16 variables 0 0 0 54 46Depth 2 & 16 variables 0 0 19 40 40Depth 3 & 16 variables 0 0 26 38 35

Panel B: Cross-validated diﬀerence to APOs

Depth 1 & 5 variables 3.27 ∗∗∗ ∗∗∗ ∗ (0.41) (0.42) (0.50) (0.49) (0.42)Depth 2 & 5 variables 3.79 ∗∗∗ ∗∗∗ ∗∗∗ (0.41) (0.42) (0.42) (0.52) (0.46)Depth 3 & 5 variables 3.92 ∗∗∗ ∗∗∗ ∗∗∗ (0.42) (0.43) (0.47) (0.48) (0.48)Depth 1 & 16 variables 3.44 ∗∗∗ ∗∗∗ ∗∗ (0.41) (0.43) (0.51) (0.48) (0.42)Depth 2 & 16 variables 3.63 ∗∗∗ ∗∗∗ ∗∗ (0.42) (0.43) (0.49) (0.50) (0.44)Depth 3 & 16 variables 3.51 ∗∗∗ ∗∗∗ ∗∗ (0.43) (0.45) (0.47) (0.49) (0.47) Note:

Panel A shows the percentage of individuals being assigned to a speciﬁc program.Panel B shows a t-test of the diﬀerence of the cross-validated policy (standard errors inparentheses) and the APOs of the programs. ∗ p < ∗∗ p < ∗∗∗ p < the nuisance parameters. We build the decision tree in four folds and evaluate the valuein the left out fold. First, we inspect how often the recommendations based on thesetrees coincide with the full sample policy rules. Figures C.8 and C.10 show that thecross-validated trees are not identical to the full sample ones.Zhou et al. (2018) propose another validation idea and test whether the optimal policyrules perform signiﬁcantly better than sending all individuals to the same program. Thisis achieved by taking the diﬀerence of the APO score of the cross-validated policy rule andthe APO score of the program w : ˆ∆ cvi,w ( π ) = P Tt =0 (ˆ π cv ( Z i ) = t )ˆΓ i,t − ˆΓ i,w , where ˆ π cv ( Z i )is the policy rule that is estimated without individual i . A standard t-test on the mean ofˆ∆ cvi,w ( π ) tests then whether the cross-validated policy rules are signiﬁcantly better thansending everybody to the same program. Note that the cross-validated policy rules do notnecessarily coincide with the trees in the full sample and the cross-validation estimates not28he value function for that speciﬁc tree. This would require to hold out a test set, whichwould be viable for an application with bigger programs.The results are provided in Panel B of Table 7. We can interpret the mean of ˆ∆ cvi,w ( π )as average treatment eﬀect comparing a regime under the estimated assignment ruleor a regime where everybody is sent to the same program. This eﬀect is positive forall but one tree speciﬁcations indicating that the estimated rules can leverage the eﬀectheterogeneities to improve the allocation. However, the cross-validated policy rules performnot signiﬁcantly better than sending just everybody into vocational or computer programs.This would probably change if we could take costs or capacity constraints into account.However, we do not observe costs in this dataset and the optimal decision tree algorithmis currently not capable of incorporating capacity constraints in a systematic way. Weleave both extensions for future research using a more detailed database on both costsand capacity constraints. This paper considers recent methodological developments based on Double Machine Learn-ing (DML) through the lens of a standard program evaluation under unconfoundedness.DML based methods provide a convenient toolbox for a comprehensive program evaluationas diﬀerent parameters of interest can be estimated using the same framework and a combi-nation of standard statistical software. The application to an Active Labor Market Policyevaluation shows that the methods also produce plausible results in practice. The onlyexception is the DR-learner that required a modiﬁcation before producing stable resultsfor all individualized treatment eﬀects. However, several conceptual and implementationalissues remain open for investigation and reﬁnement.In general, we know little about how to choose the estimator for the nuisance parame-ters. The pool of potential machine learning algorithms and their combinations is largeand little is known, e.g., about the trade-oﬀ between high prediction performance andcomputation time in the causal setting. Also clear recommendations for the implemen-tation of cross-ﬁtting are missing. Another open question is how to deal with common29upport in general and for each estimand speciﬁcally. The literature on trimming rules iswell developed for propensity score based methods estimating average eﬀects. However, weare not only interested in average eﬀects and the propensity score is not the only nuisanceparameter of DML. It remains an open question whether the established trimming methodsare also sensible in settings where common support becomes an issue.The estimators for ﬂexible heterogeneous treatment eﬀects provide interesting new tools.However, it is currently not clear to what extent we can actually explore heterogeneityor to what extent we need to pre-deﬁne the heterogeneity of interest. The possibility tosummarize pre-deﬁned heterogeneity of interest using OLS or kernel regressions provideclearly valuable and easy to use options in applications. The instability of methodsthat aim for individualized heterogeneous eﬀects shows that they should be used withcaution and more research is required to investigate whether adjustments like the proposedNDR-learner are useful beyond the application of this paper.The estimation of optimal treatment assignment rules is mostly unexplored in practiceand many interesting issues in applications regarding inference, the implementation ofdiﬀerent constraints, more ﬂexible rules than decision trees, or the choice of variablesthat could or should enter the set of policy variables, which could be explored in futureresearch.The investigation of these DML speciﬁc questions but also the comparison with othermore specialized causal machine learning methods for each estimand provides also aninteresting direction of future research. Such evidence would help to understand and guidewhich choices are critical in applications similar to the one in this paper.30 eferences

Abadie, A., & Cattaneo, M. D. (2018). Econometric methods for program evaluation.

Annual Review of Economics , , 465–503.Athey, S., & Imbens, G. (2019). Machine learning methods that economists should knowabout. Annual Review of Economics , , 685–725.Athey, S., & Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal eﬀects. Proceedings of the National Academy of Sciences , (27), 7353–7360.Athey, S., & Imbens, G. W. (2017). The state of applied econometrics: causality andpolicy evaluation. Journal of Economic Perspectives , (2), 3–32.Athey, S., Imbens, G. W., & Wager, S. (2018). Approximate residual balancing: Debiasedinference of average treatment eﬀects in high dimensions. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , (4), 597–632.Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. Annals ofStatistics , (2), 1148 - 1178.Athey, S., & Wager, S. (2017). Eﬃcient policy learning.

Retrieved from https://arxiv.org/abs/1606.02647

Avagyan, V., & Vansteelandt, S. (2017).

Honest data-adaptive inference for the av-erage treatment eﬀect under model misspeciﬁcation using penalised bias-reduceddouble-robust estimation.

Retrieved from http://arxiv.org/abs/1708.03787

Bansak, K., Ferwerda, J., Hainmueller, J., Dillon, A., Hangartner, D., Lawrence, D., &Weinstein, J. (2018). Improving refugee integration through data-driven algorithmicassignment.

Science , (6373), 325–329.Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection inhigh-dimensional sparse models. Bernoulli , (2), 521–547.Belloni, A., Chernozhukov, V., Fernández-Val, I., & Hansen, C. (2017). Program evaluationand causal inference with high-dimensional data. Econometrica , (1), 233–298.Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment eﬀectsafter selection among high-dimensional controls. Review of Economic Studies , (2),608–650. 31ertrand, M., Crépon, B., Marguerie, A., & Premand, P. (2017). Contemporaneousand post-program impacts of a public works program: Evidence from Côte d’Ivoire. World Bank Working Paper .Breiman, L. (2001). Random forests.

Machine Learning , (1), 5–32.Buja, A., Hastie, T., & Tibshirani, R. (1989). Linear smoothers and additive model. TheAnnals of Statistics , (2), 453–510.Busso, M., DiNardo, J., & McCrary, J. (2014). New evidence on the ﬁnite sample propertiesof propensity score reweighting and matching estimators. Review of Economics andStatistics , (5), 885–897.Card, D., Kluve, J., & Weber, A. (2018). What works? A meta analysis of recent activelabor market program evaluations. Journal of the European Economic Association , (3), 894–931.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W., &Robins, J. (2018). Double/Debiased machine learning for treatment and structuralparameters. The Econometrics Journal , (1), C1-C68.Chernozhukov, V., Demirer, M., Duﬂo, E., & Fernandez-Val, I. (2017). Generic machinelearning inference on heterogenous treatment eﬀects in randomized experiments.

Retrieved from http://arxiv.org/abs/1712.04802

Chernozhukov, V., Fernandez-Val, I., & Luo, Y. (2018). The sorted eﬀects method:Discovering heterogeneous eﬀects beyond their averages.

Econometrica , (6),1911–1938.Cockx, B., Lechner, M., & Bollens, J. (2020). Priority to unemployed immigrants? Acausal machine learning evaluation of training in Belgium.

CEPR Discussion PaperNo. DP14270.Colangelo, K., & Lee, Y.-Y. (2019).

Double debiased machine learning nonparametricinference with continuous treatments. cemmap working paper CWP72/19.Crépon, B., & van den Berg, G. J. (2016). Active labor market policies.

Annual Reviewof Economics , , 521–546.Davis, J. M., & Heller, S. B. (2017). Using causal forests to predict treatment heterogeneity:An application to summer jobs. American Economic Review , (5), 546–550.32udik, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning.

Retrieved from http://arxiv.org/abs/1103.4601

Fan, Q., Hsu, Y.-C., Lieli, R. P., & Zhang, Y. (2019).

Estimation of conditional averagetreatment eﬀects with high-dimensional data.

Retrieved from http://arxiv.org/abs/1908.02399

Farbmacher, H., Heinrich, K., & Spindler, M. (2019).

Heterogeneous Eﬀects of Poverty onCognition.

MEA Discussion Paper No. 06-2019.Farrell, M. H. (2015). Robust inference on average treatment eﬀects with possibly morecovariates than observations.

Journal of Econometrics , (1), 1–23.Farrell, M. H., Liang, T., & Misra, S. (2018). Deep neural networks for estimationand inference: Application to causal eﬀects and other semiparametric estimands.

Retrieved from http://arxiv.org/abs/1809.09953

Foster, D. J., & Syrgkanis, V. (2019).

Orthogonal statistical learning.

Retrieved from http://arxiv.org/abs/1901.09036

Gerﬁn, M., & Lechner, M. (2002). A microeconometric evaluation of the active labourmarket policy in Switzerland.

Economic Journal , (482), 854–893.Glynn, A. N., & Quinn, K. M. (2009). An introduction to the augmented inverse propensityweighted estimator. Political Analysis , (1), 36–56.Gulyas, A., & Pytka, K. (2019). Understanding the sources of earnings losses after jobdisplacement: A machine-learning approach.

Discussion Paper Series – CRC TR 224No. 131.Hájek, J. (1971). Comment on “An essay on the logical foundations of survey sampling,part one”. In V. P. Godambe & D. A. Sprott (Eds.),

Foundations of statisticalinference (p. 236). Toronto: Holt, Rinehart and Winston.Hayﬁeld, T., & Racine, J. S. (2008). Nonparametric econometrics: The np package.

Journal of Statistical Software , (5).Hirano, K., & Porter, J. R. (2009). Asymptotics for statistical treatment rules. Economet-rica , (5), 1683–1701.Holland, P. W. (1986). Statistics and causal inference. Journal of the American StatisticalAssociation , (396), 945–960. 33uber, M., Lechner, M., & Mellace, G. (2017). Why do tougher caseworkers increaseemployment? The role of program assignment as a causal mechanism. Review ofEconomics and Statistics , (1), 180–183.Imbens, G. W. (2000). The role of the propensity score in estimating dose-responsefunctions. Biometrika , (3), 706–710.Imbens, G. W. (2004). Nonparametric estimation of average treatment eﬀects underexogeneity: A review. Review of Economics and Statistics , (1), 4–29.Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedicalsciences . Cambridge University Press.Isphording, I. E. (2014).

Language and labor market success (No. 8572). IZA DiscussionPapers.Jacob, D., Härdle, W. K., & Lessmann, S. (2019).

Group Average Treatment Eﬀects forObservational Studies.

Retrieved from http://arxiv.org/abs/1911.02688

Kallus, N. (2018). Balanced policy evaluation and learning. In

Advances in neuralinformation processing systems (pp. 8895–8906).Kallus, N., Mao, X., & Uehara, M. (2019).

Localized Debiased Machine Learning: EﬃcientEstimation of Quantile Treatment Eﬀects, Conditional Value at Risk, and Beyond.

Retrieved from http://arxiv.org/abs/1912.12945

Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparisonof alternative strategies for estimating a population mean from incomplete data.

Statistical Science , (4), 523–539.Kennedy, E. H. (2020). Optimal doubly robust estimation of heterogeneous causal eﬀects.

Retrieved from http://arxiv.org/abs/2004.14497

Kennedy, E. H., Ma, Z., McHugh, M. D., & Small, D. S. (2017). Non-parametric methodsfor doubly robust estimation of continuous treatment eﬀects.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , , 1229–1245.Kitagawa, T., & Tetenov, A. (2018). Who should be treated? Empirical welfare maxi-mization methods for treatment choice. Econometrica , (2), 591–616.Knaus, M. C. (2018). A double machine learning approach to estimate the eﬀects of musicalpractice on student’s skills.

Retrieved from https://arxiv.org/abs/1805.10300

Journal of HumanResources , , published ahead of print 26 March 2020. doi: 10.3368/jhr.57.2.0718-9615R1Knaus, M. C., Lechner, M., & Strittmatter, A. (2020b). Machine Learning Estimation ofHeterogeneous Causal Eﬀects: Empirical Monte Carlo Evidence. The EconometricsJournal , utaa014 , published ahead of print 06 June 2020. doi: 10.1093/ectj/utaa014Knittel, C. R. (2019). Using machine learning to target treatment: The case of householdenergy use.

NBER Working Paper No. 26531.Kozbur, D. (2020). Analysis of testing-based forward model selection.

Econometrica , (5), 2147–2173.Kreif, N., & DiazOrdaz, K. (2019). Machine learning in policy evaluation: new tools forcausal inference.

Retrieved from http://arxiv.org/abs/1903.00402

Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimatingheterogeneous treatment eﬀects using machine learning.

Proceedings of the NationalAcademy of Sciences , (10), 4156–4165.Lalive, R., van Ours, J., & Zweimüller, J. (2008). The impact of active labor marketprograms on the duration of unemployment. Economic Journal , (525), 235–257.Lechner, M. (1999). Earnings and employment eﬀects of continuous gﬀ-the-job training ineast germany after uniﬁcation. Journal of Business & Economic Statistics , (1),74–90.Lechner, M. (2001). Identiﬁcation and estimation of causal eﬀects of multiple treatmentsunder the conditional independence assumption. In M. Lechner & E. Pfeiﬀer (Eds.), Econometric evaluation of labour market policies (pp. 43–58). Heidelberg: Physica.Lechner, M. (2018).

Modiﬁed causal forests for estimating heterogeneous causal eﬀects.

Retrieved from https://arxiv.org/abs/1812.09487

Lechner, M., & Smith, J. (2007). What is the value added by caseworkers?

LabourEconomics , (2), 135–151.Lunceford, J. K., & Davidian, M. (2004). Stratiﬁcation and weighting via the propensityscore in estimation of causal treatment eﬀects: A comparative study. Statistics in edicine , (19), 2937–2960.Luo, Y., & Spindler, M. (2016). High-dimensional L2-boosting: Rate of Convergence.

Retrieved from http://arxiv.org/abs/1602.08927

Manski, C. F. (2004). Statistical treatment rules for heterogeneous populations.

Econo-metrica , (4), 1221–1246.Ning, Y., Peng, S., & Imai, K. (2018). Robust estimation of causal eﬀects viahigh-dimensional covariate balancing propensity score.

Retrieved from http://arxiv.org/abs/1812.08683

Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019). Orthogonal random forest for causal infer-ence. , ,8655–8696.Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coeﬃcientswhen some regressors are not always observed. Journal of the American StatisticalAssociation , (427), 846–866.Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1995). Analysis of semiparametric regressionmodels for repeated outcomes in the presence of missing data. Journal of theAmerican Statistical Association , (429), 106–121.Robins, J. M., Sued, M., Lei-Gomez, Q., & Rotnitzky, A. (2007). Comment: Performanceof double-robust estimators when "inverse probability" weights are highly variable. Statistical Science , (4), 544–559.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonran-domized studies. Journal of Educational Psychology , (5), 688–701.Semenova, V., & Chernozhukov, V. (2020). Debiased machine learning of conditionalaverage treatment eﬀects and other causal functions. The Econometrics Journal , utaa027 , published ahead of print 29 August 2020. doi: https://doi.org/10.1093/ectj/utaa027Smucler, E., Rotnitzky, A., & Robins, J. M. (2019). A unifying approach for doubly-robustL1 regularized estimation of causal contrasts.

Retrieved from http://arxiv.org/abs/1904.03737

Stoye, J. (2009). Minimax regret treatment choice with ﬁnite samples.

Journal of conometrics , (1), 70–81.Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validityof experiments. Journal of Econometrics , (1), 138–156.Strittmatter, A. (2018). What is the value added by using causal machine learningmethods in a welfare experiment evaluation?

Retrieved from http://arxiv.org/abs/1812.06533

Sverdrup, E., Kanodia, A., Zhou, Z., Athey, S., & Wager, S. (2020). policytree: Policylearning via doubly robust empirical welfare maximization over trees.

Journal ofOpen Source Software , (50), 2232.Syrgkanis, V., & Zampetakis, M. (2020). Estimation and inference with trees and forestsin high dimensions.

Retrieved from http://arxiv.org/abs/2007.03210

Tan, Z. (2018).

Model-assisted inference for treatment eﬀects using regularized calibratedestimation with high-dimensional data.

Retrieved from http://arxiv.org/abs/1801.09817

Tian, L., Alizadeh, A. A., Gentles, A. J., & Tibshirani, R. (2014). A simple methodfor estimating interactions between a treatment and a large number of covariates.

Journal of the American Statistical Association , (508), 1517–1532.van der Laan, M. J., & Rubin, D. (2006). Targeted maximum likelihood learning. International Journal of Biostatistics , (1).Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment eﬀectsusing random forests. Journal of the American Statistical Association , (523),1228–1242.Wager, S., & Walther, G. (2015). Adaptive concentration of regression trees, with applicationto random forests.

Retrieved from http://arxiv.org/abs/1503.06388

Wunsch, C. (2016). How to minimize lock-in eﬀects of programs for unemployed workers.

IZA World of Labor .Zhou, Z., Athey, S., & Wager, S. (2018).

Oﬄine multi-action policy learning: Generalizationand optimization.

Retrieved from http://arxiv.org/abs/1810.04778

Zimmert, M., & Lechner, M. (2019).

Nonparametric estimation of causal heterogeneityunder high-dimensional confounding.

Retrieved from http://arxiv.org/abs/1908 ppendices A Doubly robust identiﬁcation

To revisit identiﬁcation and identiﬁcation double robustness of Equation 3 under As-sumption 1, rewrite the conditional average potential outcome in the following way,where µ w ( x ) = E [ Y i ( w ) | X i = x ], µ ( w, x ) = E [ Y i | W i = w, X i = x ] and e w ( x ) = E [ D i ( w ) | X i = x ] = P [ W i = w | X i = x ] b > µ w ( x ) = E " µ ( w, x ) + D i ( w )( Y i − µ ( w, x )) e w ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E " Y i ( w ) − Y i ( w ) + µ ( w, x ) + D i ( w )( Y i − µ ( w, x )) e w ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E " Y i ( w ) − Y i ( w ) + µ ( w, x ) + D i ( w )( Y i ( w ) − µ ( w, x )) e w ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E [ Y i ( w ) | X i = x ] + E " ( Y i ( w ) − µ ( w, x )) D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = µ w ( x ) + E " ( Y i ( w ) − µ ( w, x )) D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x (9)The conditional average potential outcome is thus identiﬁed if the second part ofEquation 9 equals zero. This happens under three scenarios:1. Correct propensity score and correct outcome regression: E " ( Y i ( w ) − µ ( w, x )) D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E [( Y i ( w ) − µ ( w, x )) | X i = x ] E " D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = ( E [ Y i ( w ) | X i = x ] − µ ( w, x )) E [ D i ( w ) | X i = x ] − e w ( x ) e w ( x ) ! = ( µ w ( x ) − µ ( w, x )) e w ( x ) − e w ( x ) e w ( x ) ! = ( µ w ( x ) − µ w ( x )) | {z } =0 e w ( x ) − e w ( x ) e w ( x ) !| {z } =0 = 02. Correct propensity score but instead of correct outcome regression µ ( w, x ), use some39unction g ( x ): E " ( Y i ( w ) − g ( x )) D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E [( Y i ( w ) − g ( x )) | X i = x ] E " D i ( w ) − e w ( x ) e w ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = ( E [ Y i ( w ) | X i = x ] − g ( x )) E [ D i ( w ) | X i = x ] − e w ( x ) e w ( x ) ! = ( µ w ( x ) − g ( x )) e w ( x ) − e w ( x ) e w ( x ) !| {z } =0 = 03. Correct outcome regression but instead of correct propensity score e w ( x ), use somefunction h ( x ): E " ( Y i ( w ) − µ ( w, x )) D i ( w ) − h ( x ) h ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = E [( Y i ( w ) − µ ( w, x )) | X i = x ] E " D i ( w ) − h ( x ) h ( x ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = x = ( E [ Y i ( w ) | X i = x ] − µ ( w, x )) E [ D i ( w ) | X i = x ] − h ( x ) h ( x )) ! = ( µ w ( x ) − µ ( w, x )) e w ( x ) − h ( x ) h ( x ) ! = ( µ w ( x ) − µ w ( x )) | {z } =0 e w ( x ) − h ( x ) h ( x ) ! = 040 DR- and NDR-learner

This Appendix describes the algorithms that are applied to estimate out-of-sample IATEsusing the DR- and NDR-learner. It mostly follows Algorithm 1 of Kennedy (2020) andadapts it to the situation that we are interested in estimating IATEs for all observationswithout using them in the estimation step.

Algorithm 1 (DR-learner)

Let ( S N , S N , S N , S N ) denote four independent samples of N observations of O i = ( X i , W i , Y i ) .Step 1. Nuisance training:(a) Construct a model ˆ e w ( x ) of the propensity scores e w ( x ) using S N .(b) Construct a model (ˆ µ ( w, x ) , ˆ µ ( w , x )) of the regression functions ( µ ( w, x ) , µ ( w , x )) using S N .Step 2. Pseudo-outcome regression: Construct the pseudo-outcome for every observation i insubsample S N using the models of step 1 ˆ∆ i,w,w = ˆ µ ( w, X i ) − ˆ µ ( w , X i ) + D i ( w )ˆ e w ( X i ) ˜ Y i ( w, X i ) − D i ( w )ˆ e w ( X i ) ˜ Y i ( w , X i ) , regress it on covariates X i in S N , and use the model to predict IATEs in S N , ˆ τ , w,w ( x ) .Step 3. Cross-ﬁtting: Repeat steps 1–2 twice, ﬁrst using S N for the propensity score, S N for the outcome regression and S N as subsample to obtain IATE predictions in S N ˆ τ , w,w ( x ) , and then using S N for the propensity score, S N for the outcome regressionand S N as subsample to obtain IATE predictions in S N , ˆ τ , w,w ( x ) .Step 4. Prediction: Predict IATEs in S N as the average of the three predictions ˆ τ drlw,w ( x ) =1 / τ , w,w ( x ) + 1 / τ , w,w ( x ) + 1 / τ , w,w ( x ) .Step 5. Iteration: Repeat steps 1–4 three times. First, with S N , S N and S N to predict IATEsfor S N , second with S N , S N and S N to predict IATEs for S N and ﬁnally with S N , S N and S N to predict IATEs for S N . Algorithm 2 (NDR-learner)

Let ( S N , S N , S N , S N ) denote four independent samplesof N observations of O i = ( X i , W i , Y i ) .Step 1. Nuisance training:(a) Construct a model ˆ e w ( x ) of the propensity scores e w ( x ) using S N .(b) Construct a model (ˆ µ ( w, x ) , ˆ µ ( w , x )) of the regression functions ( µ ( w, x ) , µ ( w , x )) using S N .Step 2a. Pseudo-outcome regression: Construct the pseudo-outcome for every observation i insubsample S N using the models of step 1 ˆ∆ i,w,w = ˆ µ ( w, X i ) − ˆ µ ( w , X i ) + D i ( w )ˆ e w ( X i ) ˜ Y i ( w, X i ) − D i ( w )ˆ e w ( X i ) ˜ Y i ( w , X i ) , regress it on covariates X i in S N , and use the model to predict IATEs in S N .Step 2b. Normalization: For every observation j in S N : (i) extract the weights underlyingits prediction α i ( X j ) and (ii) use it to calculate the normalized DR-learner given inEquation 8, where the sum goes over observations in S N , to obtain ˆ τ , w,w ( X j ) .Step 3. Cross-ﬁtting: Repeat steps 1–2 twice, ﬁrst using S N for the propensity score, S N for the outcome regression and S N as subsample to obtain IATE predictions in S N ˆ τ , w,w ( x ) , and then using S N for the propensity score, S N for the outcome regressionand S N as subsample to obtain IATE predictions in S N , ˆ τ , w,w ( x ) .Step 4. Prediction: Predict IATEs in S N as the average of the three predictions ˆ τ drlw,w ( x ) =1 / τ , w,w ( x ) + 1 / τ , w,w ( x ) + 1 / τ , w,w ( x ) .Step 5. Iteration: Repeat steps 1–4 three times. First, with S N , S N and S N to predict IATEsfor S N , second with S N , S N and S N to predict IATEs for S N and ﬁnally with S N , S N and S N to predict IATEs for S N . Results

C.1 Descriptives

Table C.1 provides the means of all control variables by program participation. It documentsthat especially measures of past labor market success like past income are associated withprogram participation.Table C.1: Means of control variables by program

No JS Voc Comp Lang(1) (2) (3) (4) (5)Age 36.6 37.3 37.5 39.1 35.3Mother tongue in canton’s language 0.10 0.12 0.11 0.11 0.04Lives in big city 0.19 0.19 0.21 0.11 0.23Lives in medium city 0.12 0.13 0.12 0.15 0.15Lives in no city 0.68 0.68 0.67 0.73 0.63Caseworker age 44.1 44.1 44.8 44.6 44.6Caseworker cooperative 0.48 0.50 0.41 0.42 0.45Caseworker education: above vocational training 0.45 0.45 0.44 0.48 0.48Caseworker education: tertiary track 0.19 0.21 0.17 0.16 0.21Caseworker female 0.43 0.47 0.39 0.44 0.47Missing caseworker characteristics 0.05 0.05 0.04 0.05 0.05Caseworker has own unemployemnt experience 0.62 0.63 0.64 0.61 0.63Caseworker tenure 5.48 5.44 5.73 5.83 5.61Caseworker education: vocational degree 0.26 0.27 0.22 0.25 0.22Fraction of months employed last 2 years 0.81 0.84 0.83 0.84 0.72Number of employment spells last 5 years 1.21 0.97 0.93 0.86 0.78Employability 1.93 1.98 1.93 1.97 1.85Female 0.44 0.44 0.33 0.60 0.55Foreigner with temporary permit 0.13 0.11 0.12 0.04 0.44Foreigner with permanent permit 0.23 0.22 0.18 0.17 0.23Cantonal GDP p.c. 0.52 0.53 0.51 0.53 0.54Married 0.47 0.46 0.48 0.45 0.72Mother tongue other than German, French, Italian 0.33 0.29 0.31 0.18 0.64Past income 42527.9 46693.1 48653.8 43212.8 37300.5Previous job: manager 0.08 0.08 0.10 0.09 0.07Missing sector 0.18 0.15 0.15 0.16 0.29Previous job in primary sector 0.09 0.06 0.09 0.05 0.05Previous job in secondary sector 0.12 0.14 0.15 0.13 0.12Previous job in tertiary sector 0.61 0.65 0.61 0.67 0.54Previous job: self-employed 0.01 0.00 0.00 0.00 0.00Previous job: skilled worker 0.60 0.65 0.65 0.75 0.43Previous job: unskilled worker 0.29 0.24 0.22 0.15 0.48Qualiﬁcation: semiskilled 0.16 0.14 0.17 0.14 0.15Qualiﬁcation: some degree 0.58 0.62 0.63 0.72 0.38Qualiﬁcation: unskilled 0.23 0.20 0.17 0.12 0.40Qualiﬁcation: skilled without degree 0.03 0.03 0.02 0.02 0.07Swiss citizen 0.63 0.67 0.70 0.79 0.34Allocation of unemployed to caseworkers: by industry 0.60 0.67 0.58 0.51 0.64Allocation of unemployed to caseworkers: by occupation 0.51 0.57 0.46 0.45 0.57Allocation of unemployed to caseworkers: by age 0.04 0.04 0.04 0.06 0.05Allocation of unemployed to caseworkers: by employability 0.09 0.07 0.10 0.08 0.06Allocation of unemployed to caseworkers: by region 0.13 0.09 0.09 0.13 0.11Allocation of unemployed to caseworkers: other 0.09 0.07 0.08 0.10 0.09Number of unemployment spells last 2 years 0.57 0.39 0.52 0.37 0.43Cantonal unemployment rate (in %) 3.52 3.59 3.41 3.36 3.63

Note:

Program speciﬁc means. .2 Nuisance parameters Nuisance parameters are only a tool to remove confounding but it is still informative toinvestigate which variables are most predictive of treatment probabilities and outcome.This is less straightforward for ﬂexible tools like random forests than for the well-knownregression outputs of parametric models. We conduct a classiﬁcation analysis as proposedby Chernozhukov, Fernandez-Val, and Luo (2018). To this end, we split the predictednuisance parameter distributions in quintiles and compare the covariate means of theobservations falling into the ﬁfth and ﬁrst quintile. For comparability, we normalize allcovariates to have mean zero and variance one and order the variables by their largestabsolute diﬀerence between the highest and lowest quintile.Table C.2 shows that measures of citizenship, qualiﬁcation and previous labor marketsuccess are important predictors of program selection. In line with intuition the formerseems to drive a large part of the selection into language courses. Also Table C.3 showingthe classiﬁcation analysis for outcome predictions shows intuitive patterns. Again measuresof citizenship, qualiﬁcation and previous labor market success seem predictive for futureemployment with suggested correlations pointing in the expected directions. For example,Swiss citizens, individuals with a degree and high past income are overrepresented in theupper quintile, while individuals with a non-Swiss mother tongue and no qualiﬁcation areunderepresented in the upper quintile.Finally, we investigate the propensity score distributions for all programs. Figure C.1shows that propensity scores are quite variable. This indicates that selection into programsis not negligible. Further, Table C.4 shows that some of the propensity scores get quitesmall with the smallest one being 0.003 for a computer training participant. This is notsurprising given that already the unconditional participation probabilities for computerand vocational training are only about 0.015. However, the small propensity score perse is not an indicator of poor overlap. The residual with the smallest propensity scorereceives a weight of ∼ / .

003 = 333, which is only 0.5% of the total weights. Notethat we could easily increase the smallest propensity score by randomly removing a largefraction of non-participants and participants of the job search program. This would discardvaluable information and shows that the mere focus on the smallest propensity score can44e misleading in cases with imbalanced treatment group sizes. More importantly, weobserve overlap in the sense that all treatment groups contain individuals with similarlylow propensity scores. Thus, overlap seems not to be a major issue in our application, atleast for the low dimensional parameters of interest.Table C.2: Classiﬁcation analysis of propensity scores

No program Job search Vocational Computer Language(1) (2) (3) (4) (5)Foreigner with temporary permit -0.24 -0.33 -0.32 -1.24 1.95Swiss citizen 0.09 0.21 0.43 1.60 -1.86Mother tongue other than German, French, Italian 0.11 -0.43 -0.33 -1.58 1.76Previous job: unskilled worker 0.32 -0.44 -0.60 -1.52 0.99Past income -0.71 0.77 1.40 0.28 -0.06Previous job: skilled worker -0.21 0.32 0.30 1.32 -0.90Qualiﬁcation: some degree -0.19 0.31 0.51 1.26 -0.93Qualiﬁcation: unskilled 0.05 -0.16 -0.55 -1.15 0.86Married -0.23 -0.02 -0.17 -0.60 1.09Female -0.07 -0.06 -1.04 0.99 0.42Cantonal unemployment rate (in %) -0.71 0.74 -0.91 -0.55 -0.13Foreigner with permanent permit 0.09 0.03 -0.24 -0.82 0.55Age -0.34 0.34 0.36 0.81 -0.20Cantonal GDP p.c. -0.64 0.67 -0.75 -0.17 -0.11Number of employment spells last 5 years 0.74 -0.54 -0.39 -0.46 -0.34Allocation of unemployed to caseworkers: by occupation -0.47 0.47 -0.64 -0.20 0.08Allocation of unemployed to caseworkers: by region 0.53 -0.53 0.02 0.05 0.02Fraction of months employed last 2 years -0.27 0.47 0.52 0.41 -0.46Allocation of unemployed to caseworkers: by industry -0.43 0.44 -0.48 -0.52 0.01Previous job: manager -0.20 0.18 0.51 0.18 0.10Employability -0.44 0.49 0.14 0.34 -0.24Previous job in tertiary sector -0.20 0.27 0.01 0.45 -0.31Missing sector 0.16 -0.29 -0.41 -0.34 0.43Lives in big city 0.08 -0.08 -0.02 -0.43 0.10Caseworker cooperative -0.06 0.10 -0.43 -0.22 -0.02Caseworker female -0.29 0.27 -0.41 0.15 -0.02Number of unemployment spells last 2 years 0.39 -0.33 -0.04 -0.39 0.02Qualiﬁcation: skilled without degree -0.08 -0.02 -0.11 -0.27 0.36Lives in no city -0.01 0.04 -0.02 0.33 -0.12Previous job in primary sector 0.30 -0.26 0.30 -0.27 -0.03Qualiﬁcation: semiskilled 0.24 -0.23 -0.01 -0.24 0.10Caseworker tenure -0.03 0.02 0.22 0.11 0.05Previous job in secondary sector -0.14 0.16 0.21 -0.05 -0.02Allocation of unemployed to caseworkers: by employability 0.19 -0.19 0.11 -0.00 -0.01Caseworker education: tertiary track -0.19 0.19 -0.11 -0.17 0.00Caseworker age -0.06 0.04 0.12 0.08 0.17Mother tongue in canton’s language -0.03 0.11 -0.05 0.02 0.16Caseworker has own unemployemnt experience -0.14 0.16 0.09 -0.00 0.01Caseworker education: vocational degree -0.11 0.14 -0.15 0.08 -0.15Allocation of unemployed to caseworkers: other 0.06 -0.04 -0.14 0.01 -0.07Caseworker education: above vocational training 0.12 -0.13 0.04 0.12 0.02Allocation of unemployed to caseworkers: by age -0.02 0.01 -0.08 0.07 -0.00Lives in medium city -0.08 0.03 0.06 0.05 0.06Previous job: self-employed 0.01 0.01 0.04 0.03 -0.07Missing caseworker characteristics 0.07 -0.06 0.04 -0.02 0.01

Note:

Table shows the diﬀerences in means of normalized covariates between the ﬁfth and the ﬁrst quintile of the respectivepropensity score distribution. Variables are ordered according to the largest absolute diﬀerence.

No program Job search Vocational Computer Language(1) (2) (3) (4) (5)Mother tongue other than German, French, Italian -1.44 -1.66 -1.15 -2.05 -1.88Swiss citizen 1.36 1.58 1.13 1.97 1.84Qualiﬁcation: some degree 1.79 1.97 1.61 1.67 1.87Previous job: unskilled worker -1.72 -1.75 -1.57 -1.67 -1.91Past income 1.50 1.39 1.18 0.85 1.72Qualiﬁcation: unskilled -1.42 -1.57 -1.56 -1.41 -1.54Previous job: skilled worker 1.23 1.26 1.21 1.37 1.40Foreigner with permanent permit -0.87 -1.10 -0.71 -1.36 -1.17Number of unemployment spells last 2 years -0.92 -0.80 -1.22 -0.59 -0.54Married -1.00 -1.14 -0.60 -1.16 -1.03Foreigner with temporary permit -0.83 -0.87 -0.71 -1.11 -1.16Fraction of months employed last 2 years 0.99 0.79 0.93 0.65 0.87Employability 0.94 0.83 0.45 0.50 0.61Cantonal unemployment rate (in %) -0.10 -0.05 -0.81 -0.04 -0.04Cantonal GDP p.c. -0.04 0.01 -0.79 0.04 0.06Age -0.72 -0.77 -0.24 -0.32 -0.26Lives in big city -0.24 -0.23 -0.73 -0.37 -0.28Missing sector -0.50 -0.48 -0.63 -0.53 -0.71Number of employment spells last 5 years -0.53 -0.42 -0.71 -0.46 -0.40Qualiﬁcation: semiskilled -0.62 -0.67 -0.27 -0.47 -0.59Lives in no city 0.26 0.22 0.66 0.43 0.32Previous job: manager 0.55 0.53 0.44 0.30 0.66Female -0.42 -0.33 -0.48 0.30 -0.59Previous job in tertiary sector 0.32 0.37 0.17 0.54 0.51Mother tongue in canton’s language -0.22 -0.24 -0.18 -0.44 -0.32Qualiﬁcation: skilled without degree -0.35 -0.38 -0.22 -0.34 -0.35Allocation of unemployed to caseworkers: by occupation 0.17 0.21 0.30 0.38 0.24Previous job in secondary sector 0.07 0.05 0.29 -0.04 0.08Caseworker age 0.01 0.03 0.27 -0.13 0.06Caseworker female -0.00 0.02 -0.18 0.24 -0.02Previous job in primary sector 0.06 -0.04 0.22 -0.17 -0.01Caseworker tenure -0.04 -0.06 -0.10 -0.21 -0.06Lives in medium city -0.09 -0.04 -0.06 -0.17 -0.12Allocation of unemployed to caseworkers: by employability 0.05 0.03 0.16 0.05 0.04Caseworker education: vocational degree 0.12 0.09 0.15 0.08 0.10Caseworker education: above vocational training 0.02 0.04 0.10 0.13 0.05Allocation of unemployed to caseworkers: by industry 0.12 0.12 0.10 0.11 0.09Allocation of unemployed to caseworkers: by region 0.05 -0.03 0.11 -0.03 -0.03Missing caseworker characteristics -0.04 -0.05 -0.10 0.02 -0.05Caseworker cooperative -0.03 -0.04 -0.08 0.03 -0.04Allocation of unemployed to caseworkers: by age -0.01 -0.00 0.03 0.07 0.03Caseworker education: tertiary track 0.05 0.04 0.01 -0.06 0.00Caseworker has own unemployemnt experience 0.03 0.05 0.00 0.01 0.02Previous job: self-employed -0.03 -0.01 -0.03 -0.02 0.02Allocation of unemployed to caseworkers: other -0.03 -0.03 0.00 -0.01 0.01

Note:

Table shows the diﬀerences in means of normalized covariates between the ﬁfth and the ﬁrst quintile of the respectiveoutcome prediction distribution. Variables are ordered according to the largest absolute diﬀerence.

Table C.4: Summary statistics of propensity score distributions

No program Job search Vocational Computer LanguageMean 0.763 0.185 0.014 0.015 0.024SD 0.093 0.094 0.006 0.007 0.030Minimum 0.328 0.028 0.005 0.003 0.005Q1 0.435 0.044 0.007 0.004 0.007Q25 0.727 0.121 0.009 0.009 0.010Q50 0.778 0.172 0.012 0.013 0.015Q75 0.822 0.223 0.017 0.018 0.025Q99 0.910 0.517 0.031 0.038 0.110Maximum 0.935 0.619 0.061 0.098 0.487

Note:

The table provides summary statistics of the program speciﬁc propensityscore distributions. The rows denoted by Q show the respective quantiles.

Propensity score d e n s it y (a) No program

024 0.0 0.2 0.4 0.6

Propensity score d e n s it y (b) Job search Propensity score d e n s it y (c) Vocational Propensity score d e n s it y (d) Computer Propensity score d e n s it y (e) Language .3 Average treatment eﬀects Figure C.2 shows the APO estimates and Figure C.3 shows the eﬀects of program par-ticipation on the employment probabilities over time. The latter documents that allprogram show the well-known lock-in eﬀect within the ﬁrst months after program start(e.g. Wunsch, 2016). However, participants of the hard skill programs catch up and showa sustained increase in employment rates of up to 10 percentage points.Figure C.2: Average potential outcomes N o p r og r a m J ob s ea r c h V o ca ti on a l C o m pu t e r L a ngu a g e A v e r a g e po t e n ti a l ou t c o m e Notes:

Average potential outcomes with 95% conﬁdence intervals. Numeric results in Panel A ofTable C.5. −0.10.00.10.2 0 10 20 30

Month A v e r a g e t r ea t m e n t e ff ec t s (a) Job search −0.10.00.10.2 0 10 20 30 Month A v e r a g e t r ea t m e n t e ff ec t s (b) Vocational −0.10.00.10.2 0 10 20 30 Month A v e r a g e t r ea t m e n t e ff ec t s (c) Computer −0.10.00.10.2 0 10 20 30 Month A v e r a g e t r ea t m e n t e ff ec t s (d) Language Notes:

Solid lines show ATE, dashed lines ATET of the respective programs compared tononparticipation on employment probability in the 31 months after program start. Grey areadepicts the 95% conﬁdence intervals.

Estimate Standard error(1) (2)

Panel A:

APONo program 14.80 0.06Job search 13.78 0.12Vocational 18.02 0.42Computer 18.16 0.43Language 17.36 0.37

Panel B:

ATEJob search - no program -1.02 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗

Panel C:

ATETJob search - no program -0.98 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗

Note:

Table shows DML based point estimates andstandard errors of average eﬀects. ∗ p < ∗∗ p < ∗∗∗ p < .4 Kernel regression CATEs Figure C.4 shows that the kernel regression CATEs detect no substantial heterogeneity forindividuals of diﬀerent age. Either the cross-validated bandwidth is very large estimatingbasically a constant eﬀect for job search, vocational training and computer courses, or thebandwidth seems too small leading to imprecise and erratic estimates around the meaneﬀect for language programs.Figure C.4: Eﬀect heterogeneity regarding age −50510 30 40 50

Age C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (a) Job search −50510 30 40 50 Age C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (b) Vocational −50510 30 40 50 Age C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (c) Computer −50510 30 40 50 Age C ond iti on a l a v e r a g e t r ea t m e n t e ff ec t (d) LanguageDotted line indicates point estimate of the respective average treatment eﬀect. Grey area shows95%-conﬁdence interval. .5 IATEs Figure C.5 documents the extreme IATE predictions obtained using the full sample.Especially the DR-learner produces very extreme estimates for vocational training rangingfrom -209 to 165. Also in this extreme case, the NDR-learner mitigates the problemsubstantially. However, Table C.6 documents that it still produces implausibly high valuesranging from -21 to 23. The out-of-sample prediction of IATEs is thus preferred anddiscussed in the main text.Figure C.5: Boxplot of IATEs estimated by DR- and NDR-learner −200−1000100 J ob s ea r c h V o ca ti on a l C o m pu t e r L a ngu a g e I nd i v i du a li ze d a v e r a g e t r ea t m e n t e ff ec t DRL.oosNDRL.oosDRL.fullNDRL.full

Note:

The ﬁgure shows the distribution of IATEs for participating in the program labeledon the x-axis vs. non-participation estimated by the DR-learner (DRL) and the NDR-learner(NDRL). The ﬁrst two boxplots of a group are obtained using the out-of-sample (oos) procedureof Appendix B and the other two from the full sample. The dashed line indicates the possiblerange of the IATE of [-31,31] to illustrate that several DR-learner estimated IATEs lie outsidethis bound.

Table C.6 and Figure C.6 provide a detailed comparison of the IATEs estimated byDR- and NDR-learner. We see that the diﬀerences are mainly driven by few outliers asindicated by the much larger kurtosis for the DR-learner IATEs. However, most of theestimates are quite similar as the correlations of at least 0.88 provided in the last row ofTable C.6 and the scatter plots in Figure C.6 document.52able C.6: Summary statistics of IATE distributions

Job search Vocational Computer LanguageDRL NDRL DRL NDRL DRL NDRL DRL NDRL

Panel A: Out-of-sample

Mean -0.98 -1.00 3.22 3.17 3.55 3.47 2.35 2.36SD 0.97 0.92 1.96 1.71 2.65 2.18 2.05 1.78Minimum -8.91 -5.11 -14.54 -6.41 -58.73 -6.06 -14.80 -5.30Q1 -3.16 -3.09 -1.88 -1.07 -3.65 -1.82 -2.55 -1.74Q25 -1.62 -1.62 2.06 2.08 2.08 2.07 1.07 1.10Q50 -1.02 -1.04 3.24 3.17 3.61 3.45 2.39 2.43Q75 -0.39 -0.43 4.43 4.27 5.16 4.86 3.61 3.63Q99 1.53 1.36 7.89 7.25 9.20 8.75 7.49 6.24Maximum 4.80 5.82 19.39 11.39 29.85 14.16 22.42 8.76Kurtosis 3.71 3.51 5.44 3.51 21.50 3.34 5.09 2.76Correlation 0.99 0.93 0.87 0.93

Panel B: Full sample

Mean -1.02 -1.04 3.21 3.12 3.31 3.09 2.64 2.62SD 1.42 1.31 8.42 3.67 2.32 2.15 1.45 1.42Minimum -12.25 -9.47 -208.94 -20.75 -38.75 -8.82 -5.77 -3.65Q1 -4.60 -4.24 -10.21 -6.59 -2.18 -2.29 -0.82 -0.68Q25 -1.90 -1.88 0.99 0.90 1.92 1.72 1.65 1.61Q50 -1.05 -1.06 3.32 3.18 3.35 3.14 2.72 2.72Q75 -0.16 -0.21 5.65 5.48 4.74 4.50 3.66 3.67Q99 2.55 2.23 16.16 11.49 8.49 8.06 5.77 5.51Maximum 12.40 8.34

Note:

The table provides summary statistics for the distributions of IATEs estimatedby the DR-learner (DRL) and NDR-learner (NDRL). The rows denoted by Q showthe respective quantiles. Correlation is calculated between the DR-learner and theNDR-learner. Bold numbers indicate values that are outside the possible range. (a) Job search (b) Vocational(c) Computer (d) Language

Notes:

Figures show the joint and marginal distributions of IATEs estimated by DR-learner andNDR-learner.

Job search Vocational Computer Language(1) (2) (3) (4)Previous job: unskilled worker 0.97 0.73 0.39 -1.38Past income -1.33 -0.92 -1.13 1.03Mother tongue other than German, French, Italian 0.65 0.71 0.05 -1.28Qualiﬁcation: some degree -0.85 -0.68 -0.44 1.28Swiss citizen -0.61 -0.68 0.05 1.27Qualiﬁcation: unskilled 0.77 0.47 0.32 -1.16Fraction of months employed last 2 years -1.02 -0.42 -0.44 0.35Previous job: skilled worker -0.76 -0.47 -0.17 1.02Foreigner with permanent permit 0.32 0.47 -0.16 -0.82Female 0.54 0.09 0.79 -0.51Foreigner with temporary permit 0.47 0.38 0.13 -0.78Married 0.32 0.65 0.20 -0.76Missing sector 0.67 0.07 0.17 -0.57Cantonal GDP p.c. 0.26 -0.66 -0.02 0.25Cantonal unemployment rate (in %) 0.34 -0.61 0.02 0.11Previous job in tertiary sector -0.37 -0.31 0.04 0.60Employability -0.50 -0.59 -0.57 0.26Number of employment spells last 5 years 0.46 0.15 0.04 -0.06Number of unemployment spells last 2 years 0.46 0.09 0.21 -0.11Previous job: manager -0.33 -0.35 -0.31 0.44Age 0.05 0.42 0.43 -0.00Lives in big city 0.23 -0.38 -0.09 -0.10Caseworker age 0.13 0.37 -0.01 0.05Lives in no city -0.34 0.24 0.08 0.10Qualiﬁcation: semiskilled 0.20 0.34 0.20 -0.29Allocation of unemployed to caseworkers: by region -0.30 0.08 -0.04 -0.13Allocation of unemployed to caseworkers: by occupation 0.12 0.03 0.24 0.30Previous job in primary sector -0.27 0.25 -0.20 -0.16Caseworker female 0.05 -0.25 0.25 -0.05Qualiﬁcation: skilled without degree 0.13 0.09 0.04 -0.22Caseworker education: above vocational training 0.03 0.13 0.11 0.20Lives in medium city 0.20 0.12 -0.00 -0.02Mother tongue in canton’s language 0.05 0.09 -0.18 -0.11Previous job in secondary sector -0.01 0.17 -0.09 -0.09Allocation of unemployed to caseworkers: by employability -0.16 0.03 0.03 0.06Allocation of unemployed to caseworkers: by industry 0.07 -0.14 -0.01 0.02Caseworker education: tertiary track -0.02 -0.14 -0.13 -0.14Caseworker education: vocational degree -0.14 0.03 -0.13 -0.02Caseworker cooperative 0.03 -0.12 0.13 -0.07Allocation of unemployed to caseworkers: by age -0.03 0.01 0.10 0.02Missing caseworker characteristics 0.03 -0.10 0.09 -0.09Caseworker tenure 0.06 0.05 -0.07 -0.09Caseworker has own unemployemnt experience 0.05 -0.05 -0.06 0.06Previous job: self-employed 0.05 0.02 0.03 0.04Allocation of unemployed to caseworkers: other -0.03 0.02 -0.03 0.00

Note:

Table shows the diﬀerences in means of normalized covariates between the ﬁfth and the ﬁrst quintile of therespective estimated IATE distribution. Variables are ordered according to the largest absolute diﬀerence. .6 Optimal treatment assignment Figure C.7: Optimal decision tree of depth three with ﬁve covariates

Notes:

Optimal assignment rules estimated following the procedure deﬁned in Section 3.4.Figure C.8: Overlap of cross-validated policy rules with ﬁve covariates

Overlap F r ac ti on Depth 1Depth 2Depth 3

Notes:

Figure shows the fraction of cross-validated policies that agree with the full samplepolicy. 56igure C.9: Optimal decision tree of depth three with 16 covariates

Notes:

Optimal assignment rules estimated following the procedure deﬁned in Section 3.4.Figure C.10: Overlap of cross-validated policy rules with 16 covariates

Overlap F r ac ti on Depth 1Depth 2Depth 3