The Value Added of Machine Learning to Causal Inference: Evidence from Revisited Studies
TThe Value Added of Machine Learning toCausal Inference:Evidence from Revisited Studies ∗ Anna Baiardi † and Andrea A. Naghi ‡ December 2020
Abstract:
A new and rapidly growing econometric literature is makingadvances in the problem of using machine learning methods for causal in-ference questions. Yet, the empirical economics literature has not startedto fully exploit the strengths of these modern methods. We revisit influ-ential empirical studies with causal machine learning methods and identifyseveral advantages of using these techniques. We show that these advan-tages and their implications are empirically relevant and that the use ofthese methods can improve the credibility of causal analysis.
Keywords: machine learning, causal inference, average treatment effects,heterogeneous treatment effects.
J.E.L. Classification : C01, C21, D04 ∗ Baiardi acknowledges support from EU Horizon 2020, Marie Sk(cid:32)lodowska-Curie indi-vidual grant (No. 840319). Naghi acknowledges support from EU Horizon 2020, MarieSk(cid:32)lodowska-Curie individual grant (No. 797286). Financial support from the UnitedNations Sustainable Development Funds is also gratefully acknowledged. We thank par-ticipants at the Machine Learning for Economics Workshop (at Barcelona GSE SummerForum 2019) and at the Netherlands Econometrics Study Group Meeting 2020 for veryhelpful comments. Nadja van’t Hoff and Christian Wirths provided excellent researchassistance. † Department of Economics, Erasmus University and Tinbgergen Institute. Email:[email protected]. ‡ Department of Econometrics, Erasmus University and Tinbergen Institute. Email:[email protected]. a r X i v : . [ ec on . GN ] J a n Introduction
One of the key goals of empirical research in economics is to estimate thecausal effect of a variable of interest on a targeted outcome. To avoid bi-ases in the coefficients of interest due to omitted variables, particularly inobservational studies, it is often desirable to include a large number of con-trols. However, even when the number of raw covariates is relatively small,the inclusion of technical controls (e.g. dummy variables for geographicallocation, time periods, etc.), interactions and transformations can lead tosettings in which the number of covariates is large relative to the samplesize.Machine learning (ML) methods can potentially be useful in such set-tings. However, standard ML prediction models are aimed for fundamen-tally different problems than most of the empirical work in economics. MLmethods are designed and optimized for predicting the outcome in a testsample. Thus, a model is selected by optimizing the goodness of fit on theheld-out test set. In contrast, in empirical economic research, the goodnessof fit of a model is oftentimes reduced when estimating a causal effect, andthe predictive accuracy is sacrificed in order to learn more deeply about afundamental relationship that can guide policy decisions and counterfac-tual predictions (Athey and Imbens, 2019). These fundamental differenceswill eventually generate biased estimates if standard
ML techniques, de-signed for prediction, are used in the context of causal inference. Nev-ertheless, a new and rapidly growing econometric literature is making ad-vances in the problem of using ML methods for causal inference questions(see, e.g., Chernozhukov et al., 2018a; Athey et al., 2018; Wager and Athey,2018; Chernozhukov et al., 2018b). This literature brings in new insightsand theoretical results that are novel for both the ML and the economet-rics/statistics literature. Despite these advances, the empirical economicsliterature has not started yet to fully exploit the strengths of these new Note that by ’prediction’ here, we do not mean ’forecasting’. Rather, we refer toa setting where we observe both the outcome and the features/covariates in a trainingsample and the aim is to predict the outcomes for each observation in an independenttest sample, based on the actual values of the covariates in that test sample. The main underlying reason is that high dimensional regression adjustments such aslasso, ridge, elastic net etc., shrink the estimated effects by construction, and ignoringthis shrinkage will lead to biased treatment effect estimates.
The Quarterly Journal of Economics, American Eco-nomic Journal: Macroeconomics, American Economic Journal: AppliedEconomics.
We choose papers for which the full replication data set isavailable either on the journal’s website or on the authors’ website. For theATE, we revisit three observational studies: the study of Djankov et al.(2010) on the effect of corporate taxes on investment and entrepreneur-ship, the analysis of Alesina et al. (2013) on the long-term effect of ploughagriculture on gender norms, and the paper by Nunn and Trefler (2010)on the effect of skill-biased tariffs on long-term economic growth. For theHTE, we select one observational study and one randomized control trial:we extend the observational study by DellaVigna and Kaplan (2007), whichinvestigates the effect of Fox News on the Republican vote share, and theanalysis by Loyalka et al. (2019) on the effect of a teacher training random-ized intervention on student performance. All these papers include careful3conometric analyses of the main research question and mechanisms, whichwe do not aim to re-examine in full. We instead focus on analyzing themain questions.Our findings show important differences in the ATE and HTE estimatescompared to the traditional methods, both in terms of size of the treatmenteffect estimates, and in terms of statistical significance. From our results,we derive four main observations about the reasons why causal machinelearning methods are relevant for causal analysis and add value relative tothe traditional methods. These observations are supported by the theoret-ical econometrics literature on causal ML (see, for example, Athey et al.,2018; Chernozhukov et al., 2018b; Wager and Athey, 2018).Firstly, causal ML methods are powerful tools in using data to recovercomplex interactions among variables and flexibly estimate the relationshipbetween the outcome, the treatment indicator and covariates. This featureis key when drawing inference based on the assumption that the treatmentis unconfounded as in the case of most of the revisited studies, since thisassumption is not testable. As some covariates can be correlated with boththe treatment variable and the outcome, failing to condition on all relevantconfounders may lead to biased estimates for the treatment effect. For ex-ample, for the effect of corporate taxes on investment and entrepreneurship,the original analysis in Djankov et al. (2010) shows a negative and signifi-cant effect of corporate taxes on investment and entrepreneurship, but theauthors show that these results do not survive when conditioning on allthe potential controls at once. However, when implementing DML, we ob-tain larger estimates compared to Djankov et al. (2010), which are oftenstatistically significant. Similarly, our DML results for the effect of ploughcultivation on gender roles suggest a larger effect of the plough compared tothe findings in Alesina et al. (2013), when we use the instrumental variablestrategy employed in the original analysis. Furthermore, our analysis of theeffect of skill-biased tariffs on growth suggests a smaller effect comparedto Nunn and Trefler (2010), which is often not statistically significant. Wethus argue that the DML estimates are more robust to potential nonlinearconfounders. It is important to note here that the idea of estimating treatment effects withoutmaking parametric assumptions about the way in which the covariates enter the equa- systematicmodel selection . ML methods search for the best functional forms by es-timating and comparing a wide range of alternative model specifications;the model selection is thus data-driven and fully documented. For example,our results for the effect of corporate taxes, originally explored by Djankovet al. (2010), show that the data-driven model selection implemented byDML, which keeps a smaller set of influential confounding factors fromamong a large set of potential controls, leads to larger coefficients in abso-lute value and lower standard errors compared to OLS regressions whereall the covariates are included. With the traditional approach to modelselection, uncertainty about the correct specification of the model can leadto choices that are relatively ad hoc ; different specifications may lead to dif-ferent point estimates, which in turn may lead to different policy decisions.Moreover, we further illustrate how these methods are also very useful toolsfor supplementary analyses or robustness checks . Typically, supplementaryanalysis is performed by presenting a number of selected regression specifi-cations, while the approach of causal ML methods is more systematic, andensures that important transformations of covariates that are not consid- tion has already been considered in the semiparametric econometrics literature (see thereview paper of Imbens and Wooldridge, 2009, and Imbens and Rubin, 2015) . However,in practice, these semiparametric kernel methods quickly break down if they have to dealwith more than a few covariates. p -values for single hypothesis testingare not reliable. This is due to the multiple hypothesis testing problem,which can occur when researchers search iteratively for treatment effectheterogeneity, over a large number of covariates. , The econometric theory literature on adapting standard machine learn-ing techniques to causal inference questions is by now fast growing. See forexample Chernozhukov et al. (2017), Chernozhukov et al. (2018a), Atheyet al. (2018), Farrell et al. (2018) Colangelo and Lee (2020) for the ATE;and Athey and Imbens (2016), Wager and Athey (2018), Athey and Wa- While solutions have been proposed to correct for the issue of multiple hypothesistesting (for example, List et al., 2016), when the number of covariates is large, the powerof these approaches to detect heterogeneity is low (Athey and Imbens, 2017). A related issue is the ex-post selection of significant heterogeneous effects. To avoidthis problem, in randomized control trials researchers are often required to specify beforethe experiment which heterogeneous effects they are interested to look into, in order toavoid searching for, and only reporting, significant effects. However, this limits theability of the researcher to find unexpected relevant heterogeneity. Causal ML methodsensure that relevant heterogeneity is not missed while also providing valid confidenceintervals. In addition, in observational studies, where pre-analysis plans are not commonpractice, causal ML methods can be particularly useful. early applications . See forexample, Davis and Heller (2017b), Davis and Heller (2017a), Knaus et al.(2020), Strittmatter (2019) and Bertrand et al. (2017) for the causal ran-dom forest, and Deryugina et al. (2019) for the generic machine learning.In what follows, we present our methodology and main findings on theATE using double machine learning in Section 2. The methodology andanalysis of HTE via the causal random forest is summarized in Section3. Section 4 focuses on the methodology and analysis of HTE using thegeneric machine learning method. Finally Section 5 concludes.
This section contains the analysis on the ATE for the effect of corporatetaxes on investment and entrepreneurship (Djankov et al., 2010), the effectof plough agriculture on gender roles (Alesina et al., 2013), and the effectof skill-biased tariffs on growth (Nunn and Trefler, 2010), using the doublemachine learning method (Chernozhukov et al., 2017).
The method is suitable in settings with a large number of covariates relativeto the sample size (either because the number of raw covariates is large tobegin with, or there is a large number of technical controls), where typicalnon-parametric kernel or spline methods break down.The main model specification of the method, in the notation of Cher-nozhukov et al. (2018a), is the partially linear regression: Y = Dθ + g ( X ) + U (1)7 = m ( X ) + V (2)where Y is the outcome, D is the treatment variable of interest, X isa (high-dimensional) vector of controls, and U and V are disturbances.Equation (1) is the main equation of interest and the parameter θ is thetreatment effect we would like to estimate. In this model, θ quantifies the average treatment effect . The second equation is not of direct interest, butit keeps track of the dependence of the treatment on confounders. Thecovariates are related to the treatment through the function m ( X ) and tothe outcome variable through the function g ( X ). While m ( X ) and g ( X )can be nonlinear, the treatment variable, D , enters the model linearly (andadditively). In observational studies, the function m is typically nonzero,which means that the treatment assignment is not random, but dependson the covariates. The partially linear regression model is also extended toa partially linear IV model to allow for endogenous treatment. We refer tothis model as DML-IV.A first idea one might have for estimating θ with ML methods wouldbe to use a predictive-based ML approach and predict Y using D and X to obtain D ˆ θ + ˆ g ( X ). This can be done for example by an iterativemethod that alternates between estimating g with some ML method and θ with OLS. While this ’naive’ ML approach will have very good predictionperformances, the iterative ML estimator will be heavily biased with aslower than 1 / √ n convergence rate. The primarily reason for this poorperformance is the bias introduced by regularization . In order to optimizeprediction and avoid overfitting the data with complex functional forms,ML methods use regularization and shrink the less important coefficientstowards zero. This reduces overfitting by decreasing the variance of theestimator but at the same time introduces bias. The bias in estimating g transfers to the parameter of interest θ . The issue is similar to the omittedvariable bias.To overcome regularization bias, Chernozhukov et al. (2017) propose‘double machine learning’ i.e., solving two predictions problems instead ofone. First, a ML model is fitted for m in the treatment equation, and theeffect of X is partialed out from D to get the residuals ˆ V = D − ˆ m ( X ).8econd, a ML method is fitted for g in the outcome equation and theresiduals ˆ W = Y − ˆ g ( X ) are obtained. Finally, the residuals ˆ W areregressed on the residuals ˆ V to obtained the ‘debiased’ machine learningestimator, ˇ θ . It can be shown that by orthogonalizing D with respect to X and eliminating the effect of confounders by subtracting an estimate of g , ˇ θ removes the effect of regularization bias. However, ˇ θ is still subject to bias due to overfitting . For instance, whenˆ g is overfit, it will mistake noise for signal and thus it will pick up someof the noise U from the outcome equation. If U and V are correlated, theestimation error in ˆ g will be correlated with V . To break this correlationand avoid bias due to overfitting, one can rely on sample splitting. To thisend, the data is partitioned into two subsamples: a main sample and anauxiliary sample. The ML models for the two nuisance functions m and g are fit on the auxiliary sample, while the residual on residual regressionto obtain ˇ θ is fit on the main sample.A drawback of sample splitting is that the estimator of the parameterof interest θ is obtained using only the main sample, which can lead to lossof efficiency. However, one can switch the role of the main and auxiliarysamples (procedure called cross-fitting ) and average the results, which willlead to a more efficient estimator. In addition, one can perform a K -foldversion of the cross-fitting procedure, where the size of each fold is n/K .Each sample partition or fold is successively taken as the main sample whilethe complement for each fold will be the auxiliary sample. One can takethen the average of the estimates over the K -folds. To make the resultsrobust to data partitioning, the splitting in folds procedure is performed S times, and the final DML estimator is the mean (or median) over thesplits. The median version is more robust to outliers and this is the one weuse in the applications. The nuisance functions m and g can be estimated with a variety of ML methodssuch as: lasso, regression trees, random forest, boosting, neural networks, or hybridmethods. This is because the scaled estimation error, √ n (ˇ θ − θ ), contains now a term basedon the product of two estimation errors (the estimation errors in ˆ m and in ˆ g ), whichvanishes faster than the equivalent term obtained from using the naive estimator thatdepends on the estimation error of ˆ g . .2 Applications with Double Machine Learning The first paper that we revisit us-ing causal machine learning methods investigates the relationship betweencorporate taxes on investment and entrepreneurship (Djankov et al., 2010).This is an observational study that shows a negative effect of corporatetaxes on investment and entrepreneurship, by estimating OLS country-level regressions with different measures of corporate tax rates for the year2004. The sample includes a set of 50-85 countries, depending on thespecification. In the original paper, four outcome variables are examined:investment as a percentage of GDP, FDI as a percentage of GDP, businessdensity per 100 people, and the average entry rate. Three measures of cor-porate taxes are considered: statutory corporate tax rates, actual first-yearcorporate income tax liability of a new company, and the tax rate whichtakes into account actual depreciation schedules going five years forward.The original paper reports the results for several regression specifica-tions with different sets of control variables, to account for potential con-founders that correlate with corporate tax rates, and are also determinantsof the outcomes. Djankov et al. (2010) present regression results where thefirst three sets of covariates are added separately. A final robustness checkincludes all control variables (12 in total) in the same regression. In thespecifications which include only one set of controls at a time, the papershows a negative and statistically significant effect of corporate taxes onentrepreneurship and investment. However, when adding all the controls,the relationship is still negative, but the coefficients are smaller in size andno longer statistically significant.
DML Analysis.
We revisit the final robustness check of the paper,which includes all four sets of covariates at the same time, using the DMLpartially linear model. Table 1 presents the results. Columns (1) to (7) The first set of controls includes measures of other taxes; the second set includesmeasures for the number of other tax payments made and for tax evasion; the third setincludes measures for institutions; the fourth set includes measures of inflation. SectionA.1 of the Appendix includes more details on the regressions estimated in Djankov et al.(2010) and describes the control variables. (1) (2) (3) (4) (5) (6) (7) (8)Lasso Reg. Tree Boosting Forest Neural Net. Ensemble Best OLS
Panel A: Investment 2003-2005
Statutory corporate tax rate -0.074 -0.069 -0.068 -0.07 -0.056 -0.066 -0.071 -0.064(0.09) (0.072) (0.076) (0.087) (0.102) (0.087) (0.088) (0.098)First-year effective tax rate -0.114 -0.129 -0.154 -0.144 -0.122 -0.13 -0.133 -0.117(0.094) (0.087) (0.093) (0.096) (0.097) (0.092) (0.095) (0.106)Five-year effective tax rate -0.187 -0.182 -0.211 -0.21 -0.217 -0.216 -0.207 -0.189(0.089) (0.089) (0.092) (0.097) (0.103) (0.095) (0.101) (0.118)Observations 61 61 61 61 61 61 61 61
Panel B: FDI 2003-2005
Statutory corporate tax rate -0.148 -0.157 -0.153 -0.14 -0.085 -0.133 -0.114 -0.030(0.083) (0.086) (0.092) (0.094) (0.093) (0.088) (0.09) (0.066)First-year effective tax rate -0.141 -0.194 -0.178 -0.157 -0.136 -0.161 -0.137 -0.1(0.091) (0.081) (0.081) (0.074) (0.078) (0.08) (0.079) (0.071)Five-year effective tax rate -0.147 -0.177 -0.167 -0.165 -0.139 -0.157 -0.14 -0.095(0.084) (0.073) (0.074) (0.077) (0.082) (0.077) (0.076) (0.081)Observations 61 61 61 61 61 61 61 61
Panel C: Business density
Statutory corporate tax rate -0.062 -0.092 -0.069 -0.07 -0.056 -0.066 -0.06 -0.034(0.066) (0.072) (0.061) (0.063) (0.077) (0.069) (0.064) (0.083)First-year effective tax rate -0.104 -0.156 -0.124 -0.122 -0.105 -0.114 -0.1 -0.068(0.076) (0.082) (0.07) (0.069) (0.085) (0.072) (0.07) (0.092)Five-year effective tax rate -0.091 -0.139 -0.122 -0.107 -0.115 -0.114 -0.104 -0.070(0.076) (0.08) (0.071) (0.067) (0.087) (0.074) (0.075) (0.103)Observations 60 60 60 60 60 60 60 60
Panel D: Average entry rate 2000-2004
Statutory corporate tax rate -0.112 -0.147 -0.141 -0.127 -0.067 -0.112 -0.106 -0.029(0.073) (0.068) (0.064) (0.065) (0.084) (0.067) (0.069) (0.086)First-year effective tax rate -0.130 -0.144 -0.143 -0.125 -0.131 -0.126 -0.117 -0.083(0.072) (0.064) (0.065) (0.066) (0.086) (0.07) (0.072) (0.094)Five-year effective tax rate -0.154 -0.153 -0.164 -0.164 -0.191 -0.168 -0.167 -0.133(0.084) (0.069) (0.07) (0.07) (0.091) (0.08) (0.077) (0.103)Observations 50 50 50 50 50 50 50 50Raw covariates 12 12 12 12 12 12 12 12
Notes:
Analysis of Table 5D of Djankov et al. (2010) using DML. Column 8 reportsthe original paper estimates. Standard errors are reported in parentheses. Standarderrors adjusted for variability across splits using the median method are reportedfor the DML estimates. The number of covariates does not include the treatmentvariable. display the DML point estimates for the effect of corporate taxes on in-vestment and entrepreneurship, using different ML methods to estimatethe nuisance functions. Further details on how the DML estimates areobtained, the methods used and the tuning parameters are described inSection A.1 of the Appendix.We notice that all the DML point estimates have negative signs andgenerally similar magnitudes across the ML methods. Compared to the11riginal paper results with the full set of covariates, reported in column (8),the magnitude of the DML coefficients is higher in absolute value, and thestandard errors are lower in most regressions. Additionally, the results arestatistically significant, at least at the 10% level, in half (42 out of 84) of theregressions. The main difference between our and Djankov et al. (2010)’sapproach is that the original paper results are based on the assumptionof linearity and additivity of the conditional expectations, while the DMLmethod allows for a more flexible specification. Thus, our findings are morerobust to potential nonlinear confounders compared to the original paperestimates. A researcher might be interested in investigating what are thesenonlinear terms that make the estimates different. However, this can bea challenging task when ML methods (such as neural networks, hybridmethods etc.) are used to estimate the nuisance functions. What canpotentially be done is analyzing the lasso coefficients that are not shrunkto zero and looking for nonlinearities among these. As an example, weshow in Figure B.1 the most relevant among the nonlinear terms selectedby the lasso, for one of the DML regressions reported in Table 1. Here, wenote that several nonlinear terms appear in both the treatment nuisancefunction ˆ m ( · ) and in the outcome nuisance function ˆ g ( · ). This is suggestiveof the fact that there are nonlinearities that are correlated with both thetreatment variable and the outcome. These were missed by the analysisin the original paper, and their omission could lead to biased coefficientsof the corporate taxes variables. In this case, controlling for all relevantconfounders strengthens the main results of the original analysis: in manycases the DML treatment effect estimates are larger in absolute value, andstatistically significant.This empirical example is also useful to illustrate a typical trade-offthat the applied researcher might face. On the one hand, the researcherwants to control for as many potential confounders as possible, in order toimprove the credibility of the unconfoundedness assumption. On the otherhand, naively controlling for a large set of covariates, especially when thesample size is small, can lead to imprecise estimates and larger standarderrors. Notice that in this example, the authors implement a ”kitchen sink” Further details about the lasso coefficients analysis are reported in Section A.1 ofthe Appendix.
The study by Alesina et al. (2013)examines the relationship between historical plough agriculture and genderroles. The mechanism is the following: since the plough requires physicalstrength to be operated, in areas where plough agriculture was widespread,men had an advantage in agriculture compared to women. This would re-sult in societies in which men worked in farming, whereas women’s workwould be performed mainly within the home. The division of labour bygender would translate into norms and cultural beliefs about the role ofwomen in the society, which would still persist nowadays, even after soci-eties have moved out of agriculture as the main economic activity.In the paper, the authors present results using country-level and individual-level regressions. We revisit the main question addressed in the originalpaper, focusing on the country-level results, as the majority of the re-gressions reported in the original paper are based on this data. For thecountry-level baseline regressions, estimated with OLS, three contempo-rary outcome variables are examined as measures of gender roles: femalelabour force participation, the share of firms with a woman among its prin-cipal owners, and the proportion of seats held by women in the nationalparliament. The treatment variable measures the share of individuals ineach country whose ancestors practiced plough agriculture. The baselineregressions control for income, income squared, and for measures of thehistorical characteristics of the ethnicities living in a country. Continentfixed effects are added in some specifications. As mentioned by Alesina et al. (2013), concerns about potential endo- More details on the regressions and the control variables from the study of Alesinaet al. (2013) are described in Section A.2 of the Appendix. Second, the authors use an instrumental variable approach, whichexploits the fact that plough adoption is correlated with the suitability ofthe land for cereal crops that would benefit, and crops that would not ben-efit, from the plough. To this end, two instruments for plough adoptionare constructed, based on the analysis by Pryor (1985). The first is thesuitability for ”plough-positive” (i.e. which benefit most from the plough)cereal crops, and the second is the suitability for ”plough-negative” (i.e.which benefit least from the plough) cereal crops. DML Analysis.
In our analysis, we re-examine both the country-levelOLS and IV regressions, applying the DML method. For the OLS analysis,we begin by estimating a DML partially linear model that only includes thebaseline set of controls as raw covariates. We then revisit the robustnessanalysis of this specification, by including as raw covariates the largest set ofcontrols used in the robustness checks (this corresponds to Table 7, column8 of the original paper), to which we also add the continent fixed effects. This amounts to a total of 36 raw covariates. For the IV analysis, as noted The additional controls are listed in Section A.2 of the Appendix. See Alesina et al. (2013) for details on the data used and how the instruments areconstructed. The paper also shows that the two instruments, indeed, predict ploughadoption. When revisiting the robustness analysis with DML, we include continent fixed ef-fects, even though the original paper did not include them in their most complete ro-bustness checks. As causal ML methods can handle a large number of covariates, weinclude all the covariate which were considered in the original paper, to ensure that allpotential confounders are taken into account.
14n Alesina et al. (2013), the main concern with the instrumental variablestrategy is the possibility that suitable areas for different crops could becorrelated with geographic characteristics that have an effect on gendernorms through other channels, besides plough adoption (i.e., the exclusionrestriction might not hold). Therefore, for the IV analysis, in addition tothe baseline controls and in line with the original paper, we consider the geo-climatic characteristics the authors use in their IV robustness checks (TableA14 of the Online Appendix of the original paper). To these variables, weagain add the continent fixed effects. Further details on how the DMLestimates are obtained and the tuning choices are described in Section A.2of the Appendix.Table B.2 reports the results of the DML partially linear model thatreplicates the baseline regression. In accordance with the original paper, thetreatment effect estimates are negative and statistically significant. Theyare also close to the original estimates (reproduced here for conveniencein column 8 of Table B.2), and reassuringly, fairly stable across the MLmethods. We find however very different results when carrying out therobustness analysis of this baseline specification with the DML method.Panel A of Table 2 reports the results. While the effect is still negative,albeit much smaller in absolute value, statistically significance is now lost.Interestingly, when Alesina et al. (2013) include all covariates at once (theestimate is reproduced in the last column of our Table 2), the treatmenteffect becomes smaller in absolute value, compared to when groups of co-variates are added separately (see their Table 7, columns 1 to 7), or com-pared to the baseline specification (reproduced in column 8 of our TableB.2). With DML, the treatment effect of interest does not only becomesmaller, but also statistically insignificant.Our findings up to this point would lead us to (mistakenly) concludethat the negative effect of plough adoption on attitudes towards genderroles may not be as large as suggested by the original analysis, and thatthe effect is not statistically significant. However, our estimates from theDML partially linear model may still be subject to endogeneity. Whileflexibly controlling for a large number of covariates can account for theconfounding effect of observed characteristics, the remaining concern is thatplough adoption may be correlated with unobserved characteristics that15 a b l e : O n t h e o r i g i n s o f g e nd e rr o l e s : C o un tr y - l e v e l e s t i m a t e s w i t h f u ll s e t o f c o n tr o l s ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : D M L , P a r t i a ll y l i n e a r M od e l . O u t c o m e : F e m a l e l a b o u r f o r ce pa r t i c i pa t i o n T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t S L S P a n e l B : D M L - I V . O u t c o m e : F e m a l e l a b o u r f o r ce pa r t i c i pa t i o n T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f t h e m a i n r o bu s t n e ss c h ec k s o f A l e s i n a e t a l. ( ) u s i n g D M L . C o l u m n r e p o r t s t h e r e s u l t s o f t h e m o s t c o m p l e t e r o bu s t n e ss c h ec k s f o r t h e O L S a nd I V s p ec i fi c a t i o n s i n t h e o r i g i n a l p a p e r . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . R o bu s t s t a nd a r d e rr o r s a r e r e p o r t e d i n c o l u m n . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . Overall, when looking at both the robustnessanalysis and the IV analysis and comparing them to the baseline results,we notice that our estimates move in the same direction as the originalpaper estimates, but our estimates move even more, supporting the ideathat DML controls more flexibly for relevant covariates.This empirical example is a good illustration to show the gains fromcombining modern ML tools with quasi-experimental methods such asinstrumental variables. While causal ML methods can make the uncon-foundedness assumption more plausible by flexibly controlling for observedconfounders, they cannot account for unobserved confounders. In suchsettings, the researcher could combine causal ML methods with quasi- As explained above, our DML specification differ from the original paper’s robust-ness analysis because it considers nonlinearities and it includes continent fixed effects.Therefore, the differences between the DML and the original estimates could, in princi-ple, be driven by the continent fixed effects, and not by the nonlinearities. The originalpaper shows that adding the continent fixed effects to the baseline specification leadsto very small changes in the OLS estimates (see Table 4 in the original paper), whileit results in larger changes in the IV case (see Table 8 in the original paper). However,even in the IV case, including the continent fixed effects only increases the absolute sizeof the plough coefficient by 3-4 percentage points, while the DML coefficients exceedthe OLS and 2SLS estimates by more than double that amount (with the except of theneural network and ensemble estimates). Thus, we conclude that allowing for a moreflexible nuisance function is likely to be driving at least part of the differences betweenthe DML and the 2SLS (and OLS) estimates.
The study by Nunn and Trefler(2010) investigates the relationship between skill-biased tariffs, i.e., a tar-iff structure that disproportionately favours skill-intensive industries, andlong-term economic growth. The authors develop a theoretical frameworkbased on Grossman and Helpman (1991) that shows how tariffs that fo-cus on skill-intensive industries can lead to a disproportional expansionof skill-intensive industries, which then leads to higher long-term growth.Furthermore, using both cross-country and industry level data, the pa-per provides evidence of a positive relationship between the two variables,and delves into the mechanisms of this relationship. The findings suggest18hat the mechanisms from the theoretical framework can explain only partof the total correlation between skill-biased tariffs and growth. The pa-per attributes the remaining part of the correlation to the endogeneity ofskill-biased tariffs, and in particular to the relationship between institu-tions and the skill-bias of tariffs: countries with good institutions tend toprotect more skill-intensive industries.In Nunn and Trefler (2010), three measures of the skill-bias of tariffsin the initial time period are used: the correlation between the indus-try tariffs and the industry’s skill-intensity, and two measures based onthe difference between the log average tariffs in skill-intensive industriesand log average tariffs in unskilled-intensive industries, which use differentcut-off values for industry skill-intensity. In the country-level estimates,the outcome is log annual per capita GDP growth, and the regressions in-clude a set of control variables. The country-level regressions includes 63observations.For the industry-level estimates, the outcome variable is the averageannual log change in industry output in each country, and the regressionsinclude all the controls that appear in the country-level regressions, plusindustry fixed effects. These regressions include 1004 data data points for59 countries. An additional variable (the initial industry tariff) is includedin some specifications to capture a potential mechanism: skill-biased tariffscan shift resources towards skill-intensive industries that generate positiveexternalities, thus leading to higher long-term growth. Thus, industriesthat have higher initial tariffs should have higher long-run output. If thischannel can explain the effect of skill-bias on growth, the coefficient of theskill-bias of tariffs would decrease in size when this variable is included inthe regression.
DML Analysis.
We revisit the country and industry-level regressionsreported in Tables 4 (columns 1, 2 and 4), Table 5 (columns 1, 2 and 4) andTable 6 (columns 1, 3 and 7) of Nunn and Trefler (2010). Further details on The initial time period is 1972 for 21 countries, 1980–83 for 30 countries and 1985-87for 12 countries. The end period is 2000 for most countries, except for 3 of them, forwhich data ends in 1996. See Nunn and Trefler (2010), Table 1 for a list of the countriesincluded and the respective time periods. Further details on the regressions estimated by Nunn (2007) and on the controlvariables are described in Section A.3 of the Appendix. a b l e : T h e S tr u c t u r e o f T a r i ff s a nd L o n g - T e r m G r o w t h : C o un tr y - l e v e l e s t i m a t e s ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : S k i ll t a r i ff c o rr e l a t i o n S k ill t a r i ff c o rr e l a t i o n . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l B : T a r i ff d i ff e r e n t i a l ( l o w c u t - o ff ) T a r i ff d i ff e r e n t i a l ( l o w c u t - o ff ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l C : T a r i ff d i ff e r e n t i a l ( h i g h c u t - o ff ) T a r i ff d i ff e r e n t i a l ( h i g h c u t - o ff ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f T a b l e ( c o l u m n s , , ) o f N unn a nd T r e fl e r ( ) u s i n g D M L . C o l u m n ( ) r e p o r t s t h e o r i g i n a l p a p e r e s t i m a t e s . S t a nd a r d e rr o r s a r e r e p o r t e d i np a r e n t h e s e s . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . Heterogeneous Treatment Effects withCausal Random Forest
This section focuses on the analysis of HTE for the effect of Fox News onRepublican voting (DellaVigna and Kaplan, 2007) using the causal randomforest method (Wager and Athey, 2018).
The causal random forest method is an adaptation of the original randomforest for prediction, introduced by Breiman (2001), to the problem ofcausal inference. In this section, we start by briefly presenting the generalidea of standard regression trees used for prediction, after which we describehow causal trees and causal random forests work.The idea of regression trees is to partition (or split) the data into groupsbased on the values of the covariates. The groups that are eventuallyobtained are referred to as leaves. First, one starts with the whole data setas one group. Then, for each value of each covariate, the regression treealgorithm forms candidate splits, by placing all observations that have acovariate value that is lower than than the current value in the left leaf, andall observation for which their covariate value is greater than the currentvalue in the right leaf. Among all these candidate splits, the one that isimplemented is the one that minimizes an in-sample criterion function, suchas the mean squared error (MSE) of the outcome variable within a leaf. For each of the two new leaves, the algorithm repeats the procedure until astopping rule is reached, resulting in a tree-format partition of the data.Using the terminal leaves, when the purpose is prediction, the outcomevariables of out-of-sample observations can be predicted by determiningwhich terminal leaf a new observation belongs to, based on the values ofthe covariates, and assigning as its predicted outcome the mean of theoutcomes in that leaf. This mean squared error is computed as the sum of the squared differences betweenthe outcomes of each unit within a leaf and the mean of these units in the leaf. The stopping criteria can be for example: a pre-specified maximum number of leaves,the iteration when the minimizing split gives a covariate over which the observationshave been already split by, or the iteration when the proposed split does not decreasethe mean squared error any further. causal random forest method of Wager and Athey(2018) which builds on the causal tree method of Athey and Imbens (2016).For the causal tree, first, a percentage p from the sample N is drawn withoutreplacement. Then, the subsample n = p ∗ N is further randomly split inhalf to form a training sample n tr and an estimation sample n e . Usingonly the training sample n tr , for each value of each covariate candidatesplits are formed and a regression tree as described above is constructed.The key difference in the causal case compared to the prediction case isthe objective function that is optimized when determining the split to beimplemented.Due to the fundamental problem of causal inference, directly trainingmachine learning methods on the difference Y i (1) − Y i (0), i.e., the differ-ence of the outcomes that observation i would have experienced with andwithout the treatment, is not possible, as we do not observe both outcomesfor any individual unit. Thus, instead of minimizing an infeasible MSE,Athey and Imbens (2016) propose to maximize a criterion function that re-wards a split that increases the variance of treatment effects across leavesand penalizes a split that increases within-leaf variance. The goal is toaccurately estimate treatment effects within leaves, while preserving het-erogeneity across leaves. The split is performed if it increases the criterionfunction, compared to no split. When no more splits can be done, the treeconstructed based on the first subsample is defined.The subsequent step involves turning to the estimation sample n e , andbased on the covariates, sorting each observation in this sample into thesame tree. Using only the estimation sample, the treatment effect in eachleaf is computed as ˆ τ l = ¯ y lt − ¯ y lc i.e., the mean outcome difference betweentreated ( t ) and control ( c ) observations within a leaf ( l ). The final stepconsists in returning to the full sample of N observations, examining towhich leaf each observation belongs based on the values of their covariates,and assigning that leaf’s treatment effect as the predicted treatment effectof the observation. Given that estimates from a single tree can have a highvariance, the whole algorithm described above is repeated for a number of B subsamples on which a number of B trees are obtained that eventuallyform a causal random forest. The predicted treatment effect for each unitwill be the average of predictions for that particular observation across the23rees.Notice that independent samples are used for: i) growing the tree (split-ting the data), and ii) estimating treatment effects within each leaf of thetree. This property is called honesty . Honesty leads to two desirable char-acteristics: it reduces bias from overfitting, and it makes the inference valid,since the asymptotic properties of the treatment effect estimates are thesame as if the structure of the tree had been exogenously given. Wagerand Athey (2018) establish consistency and the first asymptotic normalityresults for random forests which are then extended for the causal setting.For valid confidence intervals, a consistent estimator of the asymptotic vari-ance is proposed, based on an infinitesimal jackknife for random forests.Further details regarding the tuning parameters of the causal random for-est are provided in Section A.4 of the Appendix.
Description of Original Analysis.
In this section we revisit and furtheranalyze the study by DellaVigna and Kaplan (2007). This paper examinesthe impact of media bias on voting outcomes. Specifically, it analyzes theimpact of the entry of a conservative cable television channel, Fox News,on the Republican Party’s vote share in the United States. To identify thecausal effect of Fox News on voting, the authors investigate whether townswhere Fox News became available between 1996 and 2000 experienced anincrease in the vote share for the Republican Party in Presidential electionsduring the same time period. The estimation is performed on a data setat the town level, comprising information on 9256 towns.We consider the main outcome variable, i.e. the change in the voteshare for the Republican party between 1996 and 2000. The treatmentvariable is a dummy indicating whether Fox News had become available Sample splitting, in general, can be inefficient as part of the data is not used.However, this loss of precision does not happen in the case of causal random forests.This is because although no observation is allowed to be used within the same tree forboth partitioning the covariate space and estimation, when the data is subsampled andthe forest is obtained based on many trees, each individual unit will appear in both thetraining sample and the estimation sample of some tree. DellaVigna and Kaplan (2007) find a positive effect of Fox News onthe Republican vote share. Moreover, they explore heterogeneity along aselected set of town characteristics: the number of available cable channels,the share of urban population, and whether the town is in a swing orRepublican district. They do this by adding to the regression interactioneffects of these covariates with the treatment variable. Causal Forest Analysis.
We perform the HTE analysis using thecausal random forest method. Exploring heterogeneous effects is importantfor this study, in order to understand whether there are town or districtcharacteristics that act as effect modifiers. While the average effects areinformative for the impact of Fox News on the whole sample, it is often thecase that treatment effects are not homogeneous. It is possible that theeffect of Fox News was concentrated in some areas only. Understandingbetter the characteristics of the areas which saw the strongest and weakestresponses can shed light on the mechanisms. The aim of this exercise is two-fold. First, we take an agnostic view about the nature of heterogeneity, andwe investigate whether there are town or district characteristics which aretreatment effect modifiers. Second, we examine whether the HTE analysisfrom the original paper matches the results from the causal ML methods.We focus on one of the two preferred specifications from the originalpaper: the one that includes district fixed effects. We present results fortwo versions of the causal random forest, which account for district-leveleffects in different ways. In the first set of results, we include in the analysisdummy variables indicating the congressional district where the town islocated. In the second set of results, we implement a cluster-robust versionof the random forest developed by Athey and Wager (2019), where we treateach district as a separate cluster. The advantage of the cluster-robustcausal forest is that it does not assume that clusters have an additive effecton the outcome. Further details on the clustered-robust causal forest andtuning parameter values used for the analysis are discussed in Section A.4 Further details on the regressions and on the control variables in DellaVigna andKaplan (2007) are described in Section A.4 of the Appendix. The findings are reported in Table 6 of the original paper. (1) (2)District dummies Cluster-robustFox News effect (ATE) 0.0065 0.0065(0.0016) (.0027)Fox News effect above median 0.013 0.0072(0.0024) (0.0028)Fox News effect below median -0.0033 0.0044(0.0021) (0.0048)95% CI for the difference (0.01009, 0.02255) (-0.00806, 0.01374)Observations 9256 9256
Notes:
This table reports the estimated average treatment effect and a test for overallheterogeneity using the causal forest. Standard errors are reported in parentheses.***, ** and * * indicate significance at the 1%, 5% and 10% levels respectively. of the Appendix.We begin by discussing the average treatment effect. The results arepresented in Table 4. As in the original analysis, we find a positive andsignificant effect of Fox News on the Republican vote share, both whenincluding district dummies, and when implementing the clustered-robustcausal forest; however, the standard error in the clustered forest is larger.Our results suggest that in towns where Fox News became available theRepublican party obtained a higher vote share by 0.65 percentage pointson average, compared to towns where Fox News was not available. TheATE estimates are similar to the original paper estimates, which rangebetween 0.004 and 0.007 (reported in Table 4 of DellaVigna and Kaplan,2007, columns 4-7).Next, we want to assess whether the causal forest can recover hetero-geneity of treatment effects. As pointed out in Athey and Wager (2019),we can group observations according to whether their estimated out-of-bagconditional average treatment effect (CATE) is above or below the medianCATE, and we can estimate the average treatment effect separately forthese two subgroups. These are reported in Table 4 as
Fox News effectabove median and
Fox News effect below median . The difference betweenthe two subgroup estimates is large when including district dummies, sug-gesting that there is potential for heterogeneity, and it is statistically sig-nificant, as indicated by the fact that the 95% confidence interval for the26ifference between the two estimates does not contain zero (see column 1 ofTable 4). However, the same heuristic test for the clustered-robust forestdoes not detect significant heterogeneity in the treatment effect. This couldindicate that heterogeneity in the model with district dummy variables isoverstated, because the dummy variables cannot appropriately capture thedistrict-specific effects. The cluster-robust causal forest offers a more flex-ible way to capture district-specific effects, and may be more suitable inthis case. Although the results of the test for overall heterogeneity are mixed, itis still possible for heterogeneity to be present along some of the covariates.Hence, we investigate whether any of the included covariates are possiblesources of heterogeneity. To do this, for each variable, we split the samplein two parts, based on whether the value of the covariate of interest is belowand above the median, and we estimate the average treatment effect forthe two subsamples. Table 5 reports the HTE results along the variablesthat appear to be significant determinants of heterogeneity in both speci-fications, while B.5 and B.6 report the results for the remaining variables.In addition, to gain further insight into which variables are more importantfor heterogeneity, we compute a measure of variable importance (Athey andWager, 2019). Tables B.7 and B.8 report the variable importance measurefor the covariates included in the district dummy variable specification andfor the clustered-robust forest, respectively. We note that for both speci-fications, the variable importance measure is decreasing smoothly and wedo not observe any variable that clearly stands out in terms of importance.Our results in Table 5 show that three variables appear to be significantdeterminants of heterogeneity (at least at the 10% level) in both specifica-tions: the change in employment between 1990 and 2000, the share of thepopulation with education level equal to high school degree, and the 10thdecile in number of cable channels available. We observe that the effectof Fox News on Republican voting is stronger in towns that experienceda smaller increase in the employment rate between 1990 and 2000. Thisfinding may relate to the phenomenon of economic voting, i.e. the fact Athey and Wager (2019) find a similar result in their application, when comparingthe causal forest without clustering with the cluster-robust version. See Section A.4 of the Appendix for details on how this measure is constructed. (1) (2) (3)CATE below median CATE above median p -value difference Panel A: District dummies
Employment rate, diff. btw. 2000 and 1990 0.00928 0.00064 0.00656(0.00244) (0.00203)Share high school degree 2000 0.00805 -7e-05 0.00884(0.00226) (0.00213)Decile 10 in no. cable channels available 0.00877 -0.0044 6e-05(0.00192) (0.00264)
Panel B: Cluster-robust
Employment rate, diff. btw. 2000 and 1990 0.00938 2e-04 0.06885(0.00254) (0.00436)Share high school degree 2000 0.00859 -0.00179 0.05296(0.00303) (0.00442)Decile 10 in no. cable channels available 0.00857 -0.00495 0.02033(0.00289) (0.00506)
Notes:
This table reports the effect of Fox News on the Republican vote sharefor towns with values below (column 1) and above (column 2) the median of eachvariable. Column 3 presents the p -value for the null of no difference between theestimates in columns 1 and 2. Standard errors are reported in parentheses. that voters tend to reward incumbents during periods of economic pros-perity (e.g. Fair, 1978; Kramer, 1971; Lewis-Beck and Stegmaier, 2000;Pissarides, 1980). Areas that experienced lower economic growth (and asmaller increase in employment) may have been more easily persuaded tovote Republican in 2000, since prior to the Presidential election of 2000 aDemocratic President (Bill Clinton) had been in power for two consecutivemandates. Moreover, we observe a larger effect of Fox News in towns wherethe share of population with education level equal to high school degree isbelow median. We also find a larger positive effect of Fox News in townswhere the 10th decile in the number of cable channels is below median,while the effect is negative and insignificant in towns where this variable isabove median. Next, we investigate whether the findings regarding heterogeneity fromthe original paper are confirmed with the causal forest. DellaVigna andKaplan (2007) found a larger effect of Fox News on the Republican voteshare in towns with a smaller number of cable channels available whenincluding district fixed effects. While we do not observe significant hetero-geneity along this variable, our results for the 10th decile in the number The median value for the 10th decile in number of cable channels is zero; hence,towns with value of this variable above median correspond to towns that are in the topdecile in terms of number of cable channels available.
28f cable channels are in line with the findings of the original analysis, andhence suggest that the effect of Fox News diminishes in the presence ofhigher competition in cable channels. It is also interesting to note that thenumber of cable channels emerges as the variable with the highest impor-tance score in both specifications, which further points to the importance ofthis variable for heterogeneity. When investigating heterogeneity along thepolitical orientation of the district, we confirm the findings of DellaVignaand Kaplan (2007): we observe no significantly different effect for swingdistricts, and we obtained mixed results for Republican districts, as wefind a significantly smaller effect of Fox News in Republican districts (atthe 10% level) when including district dummies, but not with the cluster-robust forest. However, in contrast to the original analysis, we do notfind a significant difference in the effect of Fox News in rural versus urbantowns, despite this being the only heterogeneity result that is robust in allspecifications in DellaVigna and Kaplan (2007).In conclusion, our analysis of the HTE of Fox News on Republicanvoting confirms some of the findings from DellaVigna and Kaplan (2007),namely the presence of heterogeneity along the number of cable channelsand no robust heterogeneous effects for districts with different political ori-entations, but as opposed to the original paper it does not show differenteffects for urban and rural areas. The analysis with the causal forest furtheruncovers additional heterogeneity that was previously unexplored, such asa larger effect in towns that experienced a smaller increase in the employ-ment rate, and a larger effect in towns with a lower share of populationwith high school degree. Finally, including district dummy variables re-sults in the causal forest detecting more heterogeneity in treatment effectscompared to the cluster-robust version, both when implementing the over-all heterogeneity test and when analysing the HTE in terms of individualcovariates. However, the model with district dummy variables could over-state the heterogeneity compared to the cluster-robust forest if the districtdummies do not appropriately capture the district-specific effects. Thispoints to the need of a more careful treatment of the issue of clustered ob-servations when employing causal random forests for empirical applications DellaVigna and Kaplan (2007) found mixed results for Republican districts in dif-ferent specifications.
This section focuses on the analysis of HTE for the effect of a teacher train-ing intervention (Loyalka et al., 2019) using the generic machine learningmethod (Chernozhukov et al., 2018b).
A different causal ML approach for HTE is the generic machine learningmethod of Chernozhukov et al. (2018b). To make inference possible, themethod does not focus directly on the HTEs, but on features of HTEssuch as: the best linear predictor of the heterogeneous effects (BLP), thegroup average treatment effects (GATES) sorted by the groups induced bymachine learning proxies, and the average characteristics of the units inthe most and least affected groups, or classification analysis (CLAN). Thegeneric machine learning method is thus useful for empirical work as: (1) itallows detection of heterogeneity in the treatment effect, (2) computes thetreatment effect for different groups of observations (such as least affectedor most affected groups), and (3) describes which covariates are correlatedthe most with the heterogeneity.The approach is based on random splitting of the data into an auxil-iary and a main sample. The two samples are approximately equal in size.Based on the auxiliary sample, a ML estimator, called proxy predictor,is constructed for the conditional average treatment effect (CATE). Anygeneric ML method can be used for this approximation (e.g., elastic net,random forest, neural network, etc.). The proxy predictors are possiblybiased and consistency is not required. We simply take them as approxi-mations and use them to estimate and make inference on features of theCATE. Based on the main sample and the proxy predictors, we can com-pute the estimates of interest: BLP, GATES and CLAN, and then makeinference relying on many splits of the data in auxiliary and main samples.We give a brief description on how the method works in practice. Let Y
30e the outcome of interest, D the binary treatment variable, and Z a vectorof covariates. Define b ( Z ) = E [ Y (0) | Z ], the baseline conditional averageand s ( Z ) = E [ Y (1) | Z ] − E [ Y (0) | Z ], the conditional average treatmenteffect (CATE). Using the auxiliary sample we obtain ML estimators (orproxy predictors) for the baseline conditional average and the conditionalaverage treatment effect. As mentioned above, these are possibly biasedpredictors and consistency is not required. Then, for each unit in the mainsample, we compute the predicted baseline effects, B ( Z ) and the predictedtreatment effects, S ( Z ). Note that the predicted treatment effects, S ( Z ),are obtained as the difference between the predictions for the treatmentgroup model and the control group model. Following the notation fromChernozhukov et al. (2018b), the BLP parameters are obtained using themain sample, by estimating the following regression by weighted OLS, withweights 1 / ( p ( Z )(1 − p ( Z )): Y = α (cid:48) X + β ( D − p ( Z )) + β ( D − p ( Z ))( S ( Z ) − S ( Z )) + (cid:15), (3)where X = [1 , B ( Z )], p ( Z ) = P [ D = 1 | Z ] is the propensity score, and S ( Z ) is the average of the predicted treatment effect estimates on the mainsample. The control B ( Z ) is included to improve efficiency. Note that thecomponent ( D − p ( Z )) is part of the regressor ( D − p ( Z ))( S ( Z ) − S ( Z )).Thus, it orthogonalizes this regressor to all other covariates that are func-tions of Z . The coefficient β gives the average treatment effect, while β quantifies how well the proxy predictor approximates the treatment hetero-geneity. If β is different from zero, it means that there exists heterogeneityin the treatment effects.Once we obtain the predicted treatment effects, we can divide the ob-servations from the main sample in group: G , G , . . . , G K , based on theirtreatment effects. In our empirical applications, we choose K = 5, suchthat group G contains the observations with the lowest 20% treatment ef-fects and G contains observations with the highest 20% treatment effects.Then, using again the main sample, we obtain the sorted group average31reatment effects by estimating the weighted regression: Y = α (cid:48) X + K (cid:88) k =1 γ k ( D − p ( Z )) · G k ) + ν, (4)where 1( G k ) is an indicator function for whether an observation is in group k , and where the weights are the same as in (3). The parameters γ k givethe average effect in each group (GATES). Also, if the difference γ k − γ is significantly different from zero, we again have evidence for treatmenteffect heterogeneity between the most affected and least affected groups.Lastly, we can analyze the properties or characteristics of the mostaffected and least affected groups, via Classification Analysis (CLAN). Let g ( Y, Z ) be a vector of characteristics of an observation. We can computeaverage characteristics of the most affected and least affected group i.e., δ = E [ g ( Y, Z ) | G ] and δ = E [ g ( Y, Z ) | G K ], the parameters of interestbeing averages of variables directly observed. Similarly to GATES, we cancompute and make inference on the difference δ k − δ . Description of Original Analysis.
We reanalyze a large-scale random-ized experiment that investigates the effect of a teacher professional de-velopment (PD) program in China on student achievement and on otherstudent and teacher outcomes. The experiment was first studied by Loy-alka et al. (2019). Three hundred mathematics teachers, each employed indifferent schools across one province, took part in the intervention. Theteachers were randomly assigned to one of the different treatment arms: PDonly; PD plus a continuous follow-up with additional material and tasksfor the trainees; PD plus an evaluation of the extent to which the teachersremembered the content of the training sessions; or no PD (control group).The PD intervention consisted of lectures and discussions.Randomization was implemented at the school level, and in each schoolone teacher was nominated to participate in the intervention. The mainresults are obtained by estimating a cross-sectional regression, where thetreatment variable is a dummy indicating the treatment arm that the school32as assigned to. The data was collected at three points in time: at baseline,midline and endline. Outcomes are measured at midline, or endline, andthe main outcome of interest is student math achievement. The controlvariables include student characteristics, teacher characteristics and classsize. The original paper finds no significant effect of the PD intervention onstudents’ achievement after one academic year, neither for the PD inter-vention alone, nor for the PD combined with the follow up and/or theevaluation treatments. The authors also do not find any effect on otheroutcomes, such as teacher knowledge or student motivation. The lack ofeffectiveness of the program is attributed to several factors: the contentwas too theoretical, the PD was delivered passively, and teachers couldface constraints in the implementation of the suggested practices in theschools. Furthermore, the paper analyzes heterogeneous treatment effects,by interacting the treatment variable with a number of student and teachercharacteristics: student’s household wealth, baseline achievement level, theamount of training the teacher has received prior to the intervention, stu-dent and teacher gender, whether the teacher has a college degree andwhether the teacher majored in math. The findings suggest that the effectof the treatment on students’ achievement can differ by teacher characteris-tic; however, no heterogeneous effects are found in terms of characteristicsof students.
Generic ML Analysis.
We extend the analysis of HTE conducted inthe original paper, by implementing the generic machine learning methoddeveloped by Chernozhukov et al. (2018b). Exploring heterogeneous treat-ment effects is particularly relevant for this intervention, because a smalland insignificant estimate for the ATE could hide significant heterogene-ity. Our aim is to dig deeper into the analysis of heterogeneous treat-ment effects. First, we investigate whether there is significant heterogeneityin treatment effects; second, we analyze whether causal machine learningmethods, by implementing a systematic search for heterogeneity across alarge number of covariates, can offer additional insights about the charac- As Loyalka et al. (2019) show similar results when estimating the impact of the in-tervention at midline or endline, we focus on the outcome variables measured at endline. Section A.5 of the Appendix describes the regressions and the control variables. (1) (2)ATE ( β ) HET ( β )Estimate 0.002 0.65190% Confidence Interval (-0.068,0.072) (0.312,0.990) p -value 1.000 0.0003Observations 10006 10006 Notes:
The estimates are obtained using neural network to produce the proxy pre-dictor
S(Z) . The values reported correspond to the medians over 100 splits. teristics of those who benefited from the program and those who did not,compared to the traditional methods used in the original paper.In our analysis, we focus on the main outcome of interest, i.e. studentmath achievement. Since the results in the original paper are consistentlyclose to zero when comparing the three different treatment arms with thecontrol group, we choose to only analyze one of the treatment arms, corre-sponding to the PD intervention plus the evaluation. The sample that weuse includes 10,006 students in 201 schools. We follow Loyalka et al. (2019)and cluster standard errors at the school level. In addition to the full setof controls included in the original paper, we also add to our analysis othervariables that could be treatment effect modifiers: the baseline values ofa number of student-level variables, plus variables indicating teachers be-haviour in the classroom, evaluated by students at baseline. The generic method can be used in conjunction with a range of ML toolsand Chernozhukov et al. (2018b) provide two measures (Best BLP and BestGATES) to compare the performance of the different ML methods used forthe estimation of the proxy predictors. We consider the following methods:elastic net, neural network, and random forest. Based on the results ofthe Best BLP and Best GATES analysis, reported in Table B.9 of theAppendix, we choose to further work with the neural network. These additional variables are described Section A.5 of the Appendix. In Loyalkaet al. (2019), the baseline value of the outcome variable is included as a control. Hence,the baseline characteristics described above are not included in all regressions in theoriginal analysis. However, we consider these characteristics as potential drivers ofheterogeneity; therefore, we include the baseline values of all available variables in ourheterogeneity analysis. Further details on the Best BLP and GATES measures and on the tuning parametersused in this analysis are discussed in Section A.5.
Notes:
The estimates are obtained using neural network to produce the proxy pre-dictor
S(Z) . The point estimates and 90% confidence intervals correspond to themedians over 100 splits.
We first analyze whether overall heterogeneity in treatment effects canbe detected. We present results for the best linear predictor (BLP) of theCATE in Table 6. In line with the original paper, the estimated ATE,given by the coefficient β , is small (the estimated impact of the PD is0.002 standard deviations) and not significantly different from zero. Theestimated β is instead large and significantly different from zero, whichindicates that there is heterogeneity in treatment effects. Next, we estimatethe group average treatment effects (GATES). We split the sample into fivegroups, based on the quintiles of the ML proxy predictor S(Z) . This analysisreveals further insights into the extent of heterogeneity. Table B.10 of theAppendix reports the GATE in the top and bottom quintile and shows thatthe GATE in the top quintile is positive, whereas for the bottom quintilethe estimated GATE is negative. Both estimates are statistically significantat the 10% level. The difference between the GATE for the top and thebottom quintile is significant, which confirms the presence of heterogeneityin treatment effects. Additionally, Figure 1 reports the GATES estimateand the 90% confidence interval for the five quintiles, as well as for the wholesample (the ATE is represented as a blue dashed line, and the confidence35able 7: Teacher Training - Generic Method: Classification Analysis (1) (2) (3)20% most affected 20% least affected p -value for the differenceTeacher college degree 0.039 0.800 0.000(0.019,0.059) (0.780,0.820)Teacher training hours 2.447 1.684 0.000(2.399,2.494) (1.636,1.731)Teacher ranking 0.666 0.405 0.000(0.635,0.697) (0.374,0.437)Student age 14.18 13.73 0.000(14.11,14.25) (13.65,13.80)Teacher experience (years) 16.18 13.16 0.000(15.60,16.76) (12.58,13.74)Student female 0.417 0.555 0.000(0.385,0.449) (0.523,0.587)Teacher age 37.51 35.01 0.000(37.02,38.00) (34.52,35.50)Student math score at baseline -0.029 0.169 0.005(-0.088,0.031) (0.110,0.229)Student baseline math anxiety 0.298 -0.219 0.000(0.236,0.360) (-0.281,-0.157)Class size 52.87 64.37 0.000(51.82,53.93) (63.32,65.43) Notes:
This table shows the average value of the teacher and student characteristicsfor the most and least affected groups. The estimates are obtained using neuralnetwork to produce the proxy predictor
S(Z) . 90% confidence intervals are reportedin parenthesis. The variables
Student math score at baseline and
Student baselinemath anxiety are normalized. The values reported correspond to the medians over100 splits. interval as two red dashed lines). Notice that for the three middle quintilesthe effect of the teacher training intervention is not significantly differentfrom zero.We then turn to analysing the possible sources of heterogeneity, byimplementing the Classification Analysis (CLAN). Thus, we analyze furtherthe top and botton quintile in terms of ATE, for which the effect of thePD intervention is positive and negative respectively. In particular, wecompare the student and teacher characteristics in the two groups. As alarge number of covariates is available, we focus on the ten covariates forwhich the correlation with the proxy predictor, S ( Z ), is highest, reportedin Table 7. Table B.11 in the Appendix shows the CLAN analysis for theremaining covariates. We start by analyzing the characteristics of the teachers whose studentsbelong to the least and most affected group. Interestingly, the variable in- Table B.12 reports the correlation for each of the covariates with S ( Z ). The directionof the effect is consistent with what was found in the original analysis:the students in the top quintile are more likely to have been taught by ateacher who does not have a major in math, compared to the students inthe bottom quintile. It is also interesting to note that the number of hoursof training that the teacher received prior to the intervention, which is notfound to be a determinant of heterogeneity in the original paper, is higherin the most affected group compared to the least affected group. Thismay reflect the fact that teachers who have had more training in the pastmay be able to better implement the suggestions from the PD intervention.Table 7 shows that teacher rank, experience and age are higher in the mostaffected group compared to the least affected group. This is consistent withthe existence of a similar mechanism: teachers who have more experience When considering the PD plus follow-up, the authors find a significant negativeeffect on the scores of students whose teachers majored in math relative to the scores ofthose whose teachers did not. The variable indicating teacher training hours previous to the intervention is acategorical variable, based on the terciles of the continuous variable. As the continuousvariable is not included in the replication data set of the original paper, for our analysiswe use this categorical variable, which takes values 1 to 3, where 3 is the top tercile inthe number of training hours. student characteristics are poten-tial drivers of heterogeneity. In contrast to the findings in Loyalka et al.(2019), who did not find heterogeneity in terms of student features, we findthat students in the most affected group differ in terms of several char-acteristics compared to students in the least affected group. Among themost correlated with the heterogeneity score (listed in Table 7) are studentage and gender: students in the most affected group are on average abouthalf a year older than students in the least affected group, and the mostaffected group includes a larger share of male students. Additionally, stu-dents in the most affected group, on average, have a lower baseline mathscore, and tend to be more anxious about math. Thus, teacher PD could bemore beneficial for weaker students, and for students who are more anxiousabout the subject. Finally, class size appears to be a possible determinantof heterogeneity: students who benefit more from the PD tend to be insmaller classes. This result suggests that in smaller classes it may be easierfor teachers to implement some of the practices introduced during the PDtraining. For instance, Loyalka et al. (2019) mention having students worktogether in small groups as one of the techniques that were suggested inthe PD; this technique is likely to be easier to implement in smaller classes.In conclusion, our analysis confirms the presence of heterogeneous ef-fects of the teacher PD intervention, and uncovers a rich set of potentialdeterminants of heterogeneity. With the GATES analysis, we are able toshow that the achievement of students belonging to the bottom quintile isnegatively affected by the intervention, while the achievement of studentsin the top quintile is positively affected by the intervention. This confirmswhat was suggested by Loyalka et al. (2019): that there is a group of stu-dents who benefits from the intervention, and a group who does not. Inaddition, the GATES analysis shows that the effect is not significantly dif-ferent from zero for the students belonging to the middle quintiles. Withthe CLAN analysis, we can obtain a clearer picture of the characteristics ofthe groups who benefit and who do not from the intervention, compared to38he original HTE analysis. In line with Loyalka et al. (2019), we find thatteacher characteristics such as having a college degree or having a major inmath are potential determinants of heterogeneity. However, our study un-covers additional differences (that were not identified in the original paper)between the least and the most affected groups, in terms of both teacherand student characteristics, such as teacher’s rank, experience, age andnumber of training hours, as well as student’s gender, age, baseline mathscore, baseline math anxiety and class size.
Our main message is that appropriately combining predictive methods withcausal questions adds value to traditional methods and should be moreoften explored in applied research. We argue that in each revisited studythe researcher would have benefited from employing causal ML methodsand would have gained additional insights not provided by standard causalinference tools.When the researcher works with an observational study and is interestedin the ATE, causal machine learning methods can improve the credibilityof causal analysis by making the unconfoundedness assumption more plau-sible – as causal ML methods control for potential confounders in a moreflexible way; implement a systematic model selection; and are robust ap-proaches for sensitivity analysis. If the researcher is interested in HTE,causal machine learning methods can ensure that relevant heterogeneityand its determinants are not missed, or falsely discovered due to multiplehypothesis testing issues. Also, causal ML methods can be used to uncoverheterogeneity ex-post, without being bound to explore HTE only for thespecific subgroups indicated in the pre-analysis plan. Note then even if the empirical study is a randomized control trial and controllingfor confounding factors is not necessarily needed, the use of causal machine learningmethods can improve efficiency and provide more precise estimates with lower standarderrors and tighter confidence intervals. eferences Alberto Alesina, Paola Giuliano, and Nathan Nunn. On the origins of gen-der roles: Women and the plough.
The Quarterly Journal of Economics ,128(2):469–530, 2013.S Athey and GW Imbens. Machine learning methods economists shouldknow about, arxiv. 2019.Susan Athey and Guido Imbens. Recursive partitioning for heterogeneouscausal effects.
Proceedings of the National Academy of Sciences , 113(27):7353–7360, 2016.Susan Athey and Guido W Imbens. The state of applied econometrics:Causality and policy evaluation.
Journal of Economic Perspectives , 31(2):3–32, 2017.Susan Athey and Stefan Wager. Estimating treatment effects with causalforests: An application. arXiv preprint arXiv:1902.07409 , 2019.Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residualbalancing: debiased inference of average treatment effects in high di-mensions.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 80(4):597–623, 2018.Marianne Bertrand, Bruno Cr´epon, Alicia Marguerie, and Patrick Pre-mand. Contemporaneous and post-program impacts of a public worksprogram: Evidence from cˆote d’ivoire. Working Paper, 2017.Leo Breiman. Random forests.
Machine learning , 45(1):5–32, 2001.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo,Christian Hansen, and Whitney Newey. Double/debiased/neyman ma-chine learning of treatment effects.
American Economic Review , 107(5):261–65, 2017.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo,Christian Hansen, Whitney Newey, and James Robins. Double/debiasedmachine learning for treatment and structural parameters.
The Econo-metrics Journal , 21(1):C1–C68, 2018a.40ictor Chernozhukov, Mert Demirer, Esther Duflo, and Ivan Fernandez-Val. Generic machine learning inference on heterogenous treatment ef-fects in randomized experiments. Working Paper, National Bureau ofEconomic Research, 2018b.Kyle Colangelo and Ying-Ying Lee. Double debiased machine learningnonparametric inference with continuous treatments. arXiv preprintarXiv:2004.03036 , 2020.Jonathan Davis and Sara B Heller. Using causal forests to predict treat-ment heterogeneity: An application to summer jobs.
American EconomicReview , 107(5):546–50, 2017a.Jonathan MV Davis and Sara B Heller. Rethinking the benefits of youthemployment programs: The heterogeneous effects of summer jobs.
Re-view of Economics and Statistics , pages 1–47, 2017b.Stefano DellaVigna and Ethan Kaplan. The fox news effect: Media bias andvoting.
The Quarterly Journal of Economics , 122(3):1187–1234, 2007.Tatyana Deryugina, Garth Heutel, Nolan H Miller, David Molitor, andJulian Reif. The mortality and medical costs of air pollution: Evidencefrom changes in wind direction.
American Economic Review , 109(12):4178–4219, 2019.Simeon Djankov, Tim Ganser, Caralee McLiesh, Rita Ramalho, and AndreiShleifer. The effect of corporate taxes on investment and entrepreneur-ship.
American Economic Journal: Macroeconomics , 2(3):31–64, 2010.Ray C Fair. The effect of economic events on votes for president.
TheReview of Economics and Statistics , pages 159–173, 1978.Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networksfor estimation and inference: Application to causal effects and othersemiparametric estimands. arXiv preprint arXiv:1809.09953 , 2018.Gene M Grossman and Elhanan Helpman.
Innovation and growth in theglobal economy . MIT press, 1991. 41ennifer L Hill. Bayesian nonparametric modeling for causal inference.
Journal of Computational and Graphical Statistics , 20(1):217–240, 2011.Kosuke Imai, Marc Ratkovic, et al. Estimating treatment effect heterogene-ity in randomized program evaluation.
The Annals of Applied Statistics ,7(1):443–470, 2013.Guido W Imbens and Donald B Rubin.
Causal inference in statistics,social, and biomedical sciences . Cambridge University Press, 2015.Guido W Imbens and Jeffrey M Wooldridge. Recent developments in theeconometrics of program evaluation.
Journal of Economic Literature , 47(1):5–86, 2009.Michael C Knaus, Michael Lechner, and Anthony Strittmatter. Heteroge-neous employment effects of job search programmes: A machine learningapproach.
Journal of Human Resources , pages 0718–9615R1, 2020.Gerald H Kramer. Short-term fluctuations in us voting behavior, 1896–1964.
American Political Science Review , 65(1):131–143, 1971.Michael S Lewis-Beck and Mary Stegmaier. Economic determinants ofelectoral outcomes.
Annual Review of Political Science , 3(1):183–219,2000.John A List, Azeem M Shaikh, and Yang Xu. Multiple hypothesis testingin experimental economics.
Experimental Economics , pages 1–21, 2016.Prashant Loyalka, Anna Popova, Guirong Li, and Zhaolei Shi. Does teachertraining actually work? evidence from a large-scale randomized evalua-tion of a national teacher training program.
American Economic Journal:Applied Economics , 11(3):128–54, 2019.Nathan Nunn. Relationship-specificity, incomplete contracts, and the pat-tern of trade.
The Quarterly Journal of Economics , 122(2):569–600,2007.Nathan Nunn and Daniel Trefler. The structure of tariffs and long-termgrowth.
American Economic Journal: Macroeconomics , 2(4):158–94,2010. 42hristopher A Pissarides. British government popularity and economicperformance.
The Economic Journal , 90(359):569–581, 1980.Frederic L Pryor. The invention of the plow.
Comparative Studies in Societyand history , 27(4):727–743, 1985.Vira Semenova, Matt Goldman, Victor Chernozhukov, and Matt Taddy.Orthogonal machine learning for demand estimation: High dimensionalcausal inference in dynamic panels. arXiv preprint arXiv:1712.09988 ,2018.Anthony Strittmatter. What is the value added by using causal machinelearning methods in a welfare experiment evaluation? Working Paper,2019.Xiaogang Su, Chih-Ling Tsai, Hansheng Wang, David M Nickerson, andBogong Li. Subgroup analysis via recursive partitioning.
Journal ofMachine Learning Research , 10(Feb):141–158, 2009.Stefan Wager and Susan Athey. Estimation and inference of heterogeneoustreatment effects using random forests.
Journal of the American Statis-tical Association , 113(523):1228–1242, 2018.Achim Zeileis, Torsten Hothorn, and Kurt Hornik. Model-based recursivepartitioning.
Journal of Computational and Graphical Statistics , 17(2):492–514, 2008. 43
Details on Revisited Studies and Imple-mentation of Causal ML Methods
A.1 The Effect of Corporate Taxes on Investmentand Entrepreneurship
Details on the Original Analysis.
In Djankov et al. (2010), the baselineregression equation is the following: y c = α + βtaxes c + X c Γ + (cid:15) c , where c is an index for country. Four different outcome variables are ex-amined: investment as a percentage of GDP, FDI as a percentage of GDP,business density per 100 people, and the average entry rate (measured aspercentage). Three separate measures of corporate taxes are considered.The first is the statutory corporate tax rates, which is the marginal taxrate on income a corporation has to pay assuming the highest tax bracket.The second is the actual first-year corporate income tax liability of a newcompany, relative to pre-tax earnings. The third is the tax rate which takesinto account actual depreciation schedules going five years forward.The term X c denotes the control variables, aimed at capturing the ef-fect of potential confounding factors. This is an observational study, inwhich tax rates are not randomly assigned across countries. It is likelythat there will be factors which are correlated with both the treatment(corporate tax rates), and with the outcomes (measures of entrepreneur-ship and investment). To deal with this issue, the effect of corporate taxeson the outcomes is estimated by adding several control variables to theregressions. The first set of control variables are measures of other taxes:the sum of other taxes payable in the first year of operation, VAT tax, salestax, and the highest national rate on personal income tax. The second setof covariates include the logarithm of the number of tax payments made(which is used as a measure of the burden of tax administration), an in-dex of tax evasion, and the number of procedures to start a business. Thethird set of controls are institutional variables: a property rights index,an indicator of the rigidity of employment laws, a measure of a country’s44penness to trade, and the log of per capita GDP. The fourth set of covari-ates are measures of inflation: average inflation in the previous ten years,and seigniorage, which captures government reliance on printing money. Details on the DML Analysis.
The results are based on 100 splitsand 2 folds. The point estimates are calculated as the median across splits,and the standard errors are adjusted for the variability across sample splitsusing the median method, see Chernozhukov et al. (2018a).We use two hybrid ML methods in our analysis. Ensemble is a weightedaverage of estimates from lasso, boosting, random forest and neural net-works, the weights being chosen to give the lowest average mean squaredout-of-sample prediction error. Best chooses the best method for estimat-ing the nuisance functions in terms of the average out-of-sample predictionperformance among all the other methods.The lasso estimates are based on (cid:96) -penalized regressions with the penaltyparameter obtained through 10-fold cross-validation. As controls, for thelasso we consider the set of all raw covariates as well as first-order interac-tions. For the rest of the ML methods, we consider the set of raw covariatesas controls. The regression tree method fits a CART (classification and re-gression tree) tree with a penalty parameter (that restricts the tree fromoverfitting and makes sure that only splits that are considered “worthy” areimplemented) obtained with 10-fold cross validation. The random forestestimates are obtained using 1000 trees, while the Boosting estimates areobtained with 1000 boosted regression trees. For the boosting, the mini-mum number of observations in trees’ terminal nodes is set to 1 and thebag.fraction parameter is set to 0.5, except for Panel D of Table 1, where itis increased to 0.8. For the neural networks we used 2 neurons and a decayparameter of 0.01; the activation function is set to the linear function. For the analysis of nonlinear terms with lasso , we examine the estimatednuisance functions for the outcome average entry rate and the treatmentvariable first-year effective tax rate . In our analysis, for the estimation ofthe two nuisance functions, the lasso selects among the simple covariates,and their two-way interactions. It is interesting to note that a large In general, the activation function can be set to the linear function for regressionproblems (when the outcome is continuous) and to the logistic function for classificationproblems (when the outcome is categorical). For the lasso estimation, depending on the application, other nonlinear terms could m ( · ) and in the outcome nuisance function ˆ g ( · ). The lasso coefficientsare calculated as the median coefficients across splits. Among these, someappear in both nuisance functions (the coefficients of the common termsare depicted in purple in Figure B.1). A particular issue that appears withthe lasso when the interest is on analyzing the interaction terms is worthmentioning here. Since the lasso implements regularization by shrinking thesmallest coefficients to zero, it is possible that interaction terms are includedin the regression, but the coefficients of the raw covariates forming theinteractions are shrunk to zero. It is thus important to check whether theraw covariates forming these interactions also appear in the regression. Ifthe coefficients on the raw covariates are shrunk, the coefficient of the ’pure’interaction terms might not be properly captured and the found interactionterms might actually reflect the effect of the raw covariates, diminishingthe importance of our uncovered nonlinearities. Thus, when analyzingthe relevance of the interaction terms, we are careful to only report thecoefficients of the interactions for which both main effects are included inthe lasso estimation. The lasso coefficients of all the raw covariates arereported in Table B.1 of the Appendix.
A.2 The Effect of Plough Agriculture on Gender Roles
Details on the Original Analysis.
Alesina et al. (2013) consider severalempirical strategies and data sets. They start with OLS regressions per-formed using country-level and micro-level data. Then, to tackle possibleendogeneity issues, the paper follows two approaches: first, several poten-tial confounders are included in the regressions; second, an instrumentalvariable strategy is used. Our focus is on the country-level regressions.The baseline OLS country-level results in the original analysis (reportedin Table 4 of Alesina et al., 2013) are obtained by estimating the following be added, such as the squares of the covariates, or three-way interactions. It is important to note here that we do not make inference using the lasso coeffi-cients, but just analyze the magnitude of the coefficients, as a measure of the covariates’importance for predicting the outcome and the treatment variables. The instrumental variable strategy is summarized in Section 2.2.2. y c = α + βplough use c + X Hc Γ + X Cc Π + (cid:15) c , where c stands for country. In the paper, three outcome variables are ex-amined as measures of gender roles: female labour force participation, atti-tudes about women’s work, and attitudes about women as leaders. The firstoutcome variable is an indicator variable that equals one if the individualis in the labor force in 2000; the second is the share of firms with a womanamong its principal owners in the period 2003-2010; finally, the third is theproportion of seats held by women in the national parliament in 2000. Thetreatment variable, plough use c , is calculated as the estimated proportionof individuals living in a country with ancestors that used the plough inpre-industrial agriculture. The vector X Hc includes historical ethnographicvariables at the country level. These controls capture the historical char-acteristics of ethnicities living in a country, and they are meant to accountfor differences between ethnicities that historically adopted the plough andthose that did not. They include: ancestral suitability for agriculture,fraction of ancestral land that was tropical or subtropical, ancestral do-mestication of large animals, ancestral settlement patterns, and ancestralpolitical complexity. The vector X Cc denotes contemporary country-levelcontrols: natural log of real per capita GDP, and its square. These areincluded as the level of economic development is believed to have an im-pact on female labour force participation, and the square of per capitaGDP is intended to capture the observed U-shaped relation between thetwo variables. Continent fixed effects are also added in some specifications.The extended set of controls includes additional historical and contem-porary controls. Just as with the baseline controls, the additional historicalcontrols are measures of the characteristics of the ancestors of the currentpopulation living in a country. These are: the intensity of agriculture;the proportion of subsistence provided by hunting and by the herding oflarge animals; the fraction of countries’ ancestors without land inheritancerules, with patrilocal post-marital residence rules, and with matrilocal post-marital rules; the fraction of countries’ ancestors with a nuclear and anextended family structure; the average year the ethnicities were sampled47n the Ethnographic Atlas . The contemporary controls are: years of civiland interstate conflicts (1816-2007); terrain ruggedness; whether a countrywas under a communist regime after WWII; the fraction of a country’spopulation with European descent; oil production per capita; agricultural,manufacturing and services shares of GDP; and the fraction of a coun-try’s population who is Catholic, Protestant, other Christian, Muslim, andHindu. Alesina et al. (2013) provide the rationale for including each ofthese controls, and details on how the variables are constructed.The geo-climatic characteristics included in the IV analysis are: terrainslope, soil depth, average temperature, average precipitation. In the origi-nal paper, the geo-climatic characteristics are added linearly, in quadraticforms, and as linear interactions.
Details on the DML Analysis.
As in the previous example, theresults are obtained with 100 splits and 2-fold cross-fitting. We reportmedian estimates of the coefficients across the splits, and standard errorsadjusted for the variability across sample splits using the median method.The values of the tuning parameters are the same as in the first example.
A.3 The Effect of Skill-Biased Tariff on Growth
Details on the Original Analysis.
For the country-level results, Nunnand Trefler (2010) estimate the following regression equation: ln y c /y c = α + β SB SBτ c + X c β X + (cid:15) c , where ln y c /y c is the log annual per capita GDP growth in country c be-tween the beginning and the end of the time period considered, SBτ c isa measure of initial skill-bias of tariffs, and X c represents the controls.Three measures of the skill-bias of tariffs are used: the first is the correla-tion between the industry tariffs and the industry’s skill-intensity, while thesecond and third are based on the difference between the log average tar-iffs in skill-intensive industries and log average tariffs in unskilled-intensiveindustries (the two measures differ in the choice of the cut-off value for in-dustry skill-intensity, with the second using a lower cut-off than the third).The controls include: the log of the initial average level of tariffs in thecountry, three country characteristics measured at the initial period (the48og of GDP per capita, the log of human capital, and the log of the ratio ofinvestment-to-GDP), cohort fixed effects (to account for the fact that coun-tries have different initial time periods), region fixed effects (accounting for10 different regions), and two measures of initial production structure (thelog of output in skill-intensive and in unskilled-intensive industries sepa-rately).Additionally, Nunn and Trefler (2010) estimate the following regressionequation, using industry-level data: ln q ic /q ic = β q lnq ic + β τ lnτ ic + β E ln ¯ τ c + β SB SBτ c + X c β X + α i + (cid:15) ic , where lnq ic /q ic is the average annual log change in industry output inindustry i and country c ; lnq ic is the log of industry output in the initialperiod; τ ic is the log initial-period tariff; ln ¯ τ c is the average tariff, SBτ c is one of the three measures of skill-bias of tariffs, and α i are industry fixedeffects. The variable X c indicates the controls which are the same as inthe country-level regressions.The original results show a strong, positive correlation between skill-biased tariffs and long-term per capita income growth at the country level(Table 4 in Nunn and Trefler, 2010). The correlation is strong also betweenthe skill bias of tariffs and industry output growth, with and without in-cluding the initial industry tariff in the regression (Tables 5 and 6 in Nunnand Trefler, 2010 respectively). The fact that the size of the coefficient ofskill-biased tariffs remains large when adding the variable initial industrytariffs suggests that the mechanism highlighted in the model, i.e. skill-biased tariffs shifting resources towards skill-intensive industries, cannotfully account for the correlation between the treatment variable and long-term growth. Nunn and Trefler (2010) further show, with country-levelregressions, that the model mechanism can explain up to one quarter ofthe total correlation between the skill bias of tariffs and long-term growth(Table 7 in the original paper). The paper then investigates other alter-native mechanisms that can explain the independent effect of skill-biasedtariffs on output growth in Sections V, VI and VII, in the original paper. Details on the DML Analysis.
As in the previous examples, theresults are obtained with 100 splits and 2-fold cross-fitting. We report49edian estimates of the coefficients across splits, and standard errors areadjusted for the variability across sample splits using the median method.The tuning choices are the same as in the previous two examples exceptfor Neural Network in the country-level regressions where the estimates areobtained using 3 neurons and a decay parameter of 0.001.
A.4 The Effect of Fox News on the Republican VoteShare
Details on the Original Analysis.
To produce the main results (see Ta-ble IV in the original paper), the authors estimate the following regression: v R,P resk,j, − v R,P resk,j, = βd F OXk, + Γ X k, + Γ − X k, − + Γ c C k, + θ j + (cid:15) k,j , where k denotes a town in a congressional district j . The dependent vari-able is the change in Republican vote share between the 1996 and the 2000presidential elections. The treatment variable d F OXk,j, is an indicator vari-able taking the value of 1 for towns where Fox News was available by theyear 2000, and 0 otherwise. The regression includes demographic controlsat the town level: total population, the employment rate, the share ofAfrican Americans and of Hispanics, the share of males, the share of thepopulation with some college education, the share of college graduates, theshare of high school graduates, the share of the town that is urban, themarriage rate, the unemployment rate, and average income. These con-trols are added both as levels in 2000 ( X k, ) and as changes between1990 and 2000 ( X k, − ), and aim at capturing possible confounders thatcould be correlated with both the availability of Fox News and voting. Inaddition to the demographic controls, the regression includes a set of ca-ble system features, denoted by C k, , which are potentially correlatedwith the treatment variable. These are deciles in the number of chan-nels provided and in the number of potential subscribers. Finally, fixedeffects (congressional district fixed effects or county fixed effects) denotedby θ j , are added to capture trends in voting that might be common to ageographical area and also correlated with Fox News availability. In theoriginal analysis, standard errors are clustered at the cable company level.50he paper also tests whether Fox News increased voter turnout and theRepublican vote share in the Senate election.The results from the heterogeneity analysis of DellaVigna and Kaplan(2007) show a negative but insignificant effect for swing district. Addition-ally, the authors find that the effect of Fox News on the Republican voteshare is significantly smaller in towns where the number of cable channelsis higher, suggesting a negative impact of higher competition on the effectof Fox News. Moreover, the effect is found to be significantly larger inmore urban areas and smaller in more Republican districts. Regarding thelatter two findings, the authors point out that in rural areas and in Re-publican districts the Republican party tends to have a larger vote base tobegin with, thus diminishing the share of voters that could potentially beconvinced by Fox News. Out of the four effects, only the differential effectfor urban population is significant in both main specifications (county anddistrict fixed effects). The interaction of the treatment variable with theRepublican district variable is only significant when including county fixedeffects, but not when including district fixed effects, and the opposite istrue for the interaction of the treatment with the number of cable chan-nels. The authors also make a note that they find a smaller effect in theSouth, but this result is not reported in their paper, and we do not focuson it in our analysis. Details on the Analysis with the Causal Random Forest.
Thereare a number of parameters to be set in the causal random forest, algorithmsuch as the number of trees, the size of the subsample, and the minimumnumber of control and treatment units in each leaf. The number of trees is typically chosen as a trade-off between computation times and the testerror rate. A larger number of trees reduces the Monte Carlo error due tosubsampling, which means that the treatment effect predictions will varyless across different forests. A higher number of minimum treatment andcontrol units will lead to bigger leaves and a less deep tree. This willpredict less heterogeneity. A smaller number will increase the variance asthe treatment effect will be estimated with too few observations in a givenleaf. Setting a smaller subsample size will decrease the dependence acrosstrees, but will increase the variance of each estimate in a tree. The sizes ofthe training and estimation samples are typically fixed to 50% of the drawn51ubsample. If there are reasons to allocate more observations to one or theother sample, these proportions can be changed. In the algorithm, thereis also a standard parameter for the number of covariates considered for asplit , before building a tree, within a forest. This is often set to (cid:98) K/ (cid:99) in the literature, where K is the total number of covariates.In our analysis, the tuning parameter values are optimised via cross-validation, except the number of trees which is set to 2000. We performedsensitivity analysis with different values for the number of trees (1000 and5000). The results are available upon request.In the cluster-robust causal forest (Athey and Wager, 2019), when con-structing the subsample on which the forest is trained, we do not directlydraw observations, but clusters. In addition, in the final step, when con-structing the predicted out-of-bag treatment effects, an observation is con-sidered out-of-bag if its cluster was not drawn in the subsample.The variable importance measure reported in Tables B.7 and B.8 takesinto account the proportion of splits over all trees for a particular variable,weighted by depth, and it is useful for describing which covariates influencethe most the final estimates when employing the causal forest, as the in-terpretability of a single tree is lost in this case. Recall from the main textthat in the causal forest splits are performed if they maximize a criterionfunction that rewards splits that increase the variance of the treatment ef-fect across leaves, while penalizing splits that increases the variance withina leaf. Hence, higher values for this measure indicate higher importance interms of heterogeneity of treatment effects. This makes random forests different from bagged trees. In bagged trees the num-ber of predictors considered for a split is equal to the total number of covariates theresearcher considers, while in random forests, the number of predictors is strictly lessthan this total number. The procedure ’decorrelates’ the trees (as the trees will be lesssimilar) and the aggregation of predictions across trees will have a lower variance. .5 The Effect of Teacher Training on Student Per-formance Details on the Original Analysis.
In Loyalka et al. (2019), the mainresults are obtained by estimating the following regression equation: Y i,j = α + α D j + X ij α + τ k + (cid:15) i,j , where Y i,j is the outcome, measured at midline or endline, for student i inschool j ; D j is a dummy variable indicating the treatment assignment; thevector X ij includes the control variables, measured at baseline; τ k indicatesthe block fixed effects. The main outcome of interest, student achieve-ment, is measured with a 35-minutes mathematics test at endline. Thefull set of control variables includes students characteristics (age, gender,parent educational attainment, household wealth), class size, and teachercharacteristics (gender, age, experience, education level, rank, a teacher cer-tification dummy, and a dummy indicating whether the teacher majored inmath).The findings from the heterogeneity analysis suggest that the programhas a small positive effect on achievement of students taught by less qual-ified teachers and a negative effect on students whose teachers are morequalified. In addition, some evidence of heterogeneity is found in termsof whether or not the teachers majored in math, with a negative effect onachievement for those students whose teachers did major in math (this ef-fect is only found when comparing the PD plus follow up with the controlgroup).
Details on the Generic ML Analysis.
In addition to the full set ofcontrols included in the original paper, we add to our analysis the followingvariables: the baseline values of a number of student-level variables (mathself concept, math anxiety, intrinsic motivation for math, instrumental mo-tivation for math, time spent each week studying math), plus a number ofvariables indicating teachers behaviour in the classroom, evaluated by stu-dents at baseline (instructional practices of teacher, teacher care, classroom The schools were randomized within blocks. A block is defined by the year ofstudy the student is enrolled in (i.e. grades 7, 8, or 9), and by the two agencies thatimplemented the intervention. Hence, the total number of blocks is six. p -values are computed based on the median of many random conditional p -values, with nominal level adjusted again for splitting uncertainty.The Best BLP and Best GATES measures are based on maximizing thecorrelation between the proxy predictor of the conditional average treat-ment effect, S ( Z ), and the true conditional treatment effect, s ( Z ) (seeChernozhukov et al., 2018b). Table B.9 shows that this correlation is thelargest for the Neural Network. Therefore, we carry out the HTE analysisusing the Neural Network.The values of the tuning parameters were optimized via cross-validationfor the Elastic Net and Neural Network. For the random forest they areset to default values to save on computation time. For the random forest,the number of trees is set to 2000 and the number of covariates consideredfor a split is set to K/
3, which gives a value of 8.54
Additional Tables and Figures
Table B.1: The Effect of Corporate Taxes on Entrepreneurship: LassoCoefficients of Raw Covariates (1) (2)Outcome: Treatment variable:Average Entry Rate First-year Effective Tax RateLog of number of tax payments -0,402 0,017Procedures to start a business -0,006 0,001Seigniorage 2004 -0,003 0Other taxes -0,001 0,003Rigidity of employment 0 0Average inflation (1995-2004) 0 0PIT top marginal rate 0 0,01IEF Property Right Index 0 0,001VAT and sales tax 0 -0,005Tax evasion (GCR) 0,009 -0,004Log GDP pc 2003 0,011 -0,02EFW Freedom to Trade Internationally Index 0,305 -0,619
Notes:
The table shows the lasso coefficients of the raw covariates, obtained byestimating the nuisance functions g ( · ) (column 1) and m ( · ) (column 2). The lassocoefficients are calculated as the median over splits. a b l e B . : O n t h e o r i g i n s o f g e nd e rr o l e s : C o un tr y - l e v e l e s t i m a t e s , p a rt i a ll y li n e a r m o d e l ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : F e m a l e l a b o u r f o r ce pa r t i c i pa t i o n T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s P a n e l B : S ha r e o f fi r m s w i t h f e m a l e o w n e r s h i p T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s P a n e l C : S ha r e o f po l i t i c a l po s i t i o n s h e l d b y w o m e n T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f T a b l e ( c o l u m n s , , ) o f A l e s i n a e t a l. ( ) u s i n g D M L . C o l u m n r e p o r t s t h e o r i g i n a l p a p e rr e s u l t s . S t a nd a r d e rr o r s a r e r e p o r t e d i np a r e n t h e s e s . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . R o bu s t s t a nd a r d e rr o r s a r e r e p o r t e d i n c o l u m n . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . a b l e B . : T h e S tr u c t u r e o f T a r i ff s a nd L o n g - T e r m G r o w t h : I ndu s tr y - l e v e l e s t i m a t e s ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : S k i ll t a r i ff c o rr e l a t i o n S k ill t a r i ff c o rr e l a t i o n . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l B : T a r i ff d i ff e r e n t i a l ( l o w c u t - o ff ) T a r i ff d i ff e r e n t i a l ( l o w c u t - o ff ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l C : T a r i ff d i ff e r e n t i a l ( h i g h c u t - o ff ) T a r i ff d i ff e r e n t i a l ( h i g h c u t - o ff ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f T a b l e ( c o l u m n s , , ) o f N unn a nd T r e fl e r ( ) u s i n g D M L . C o l u m n ( ) r e p o r t s t h e o r i g i n a l p a p e r e s t i m a t e s . S t a nd a r d e rr o r s a r e r e p o r t e d i np a r e n t h e s e s . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . S t a nd a r d e rr o r s a d j u s t e d f o r c l u s t e r i n ga tt h ec o un t r y l e v e l a r e r e p o r t e d i n c o l u m n . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . a b l e B . : T h e S tr u c t u r e o f T a r i ff s a nd L o n g - T e r m G r o w t h : I ndu s tr y - l e v e l e s t i m a t e s ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : S k i ll t a r i ff c o rr e l a t i o n S k ill t a r i ff c o rr e l a t i o n . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l B : T a r i ff d i ff e r e n t i a l ( l o w c u t - o ff ) T a r i ff d i ff e r e n t i a l ( l o w c u t - o ff ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l C : T a r i ff d i ff e r e n t i a l ( h i g h c u t - o ff ) T a r i ff d i ff e r e n t i a l ( h i g h c u t - o ff ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f T a b l e ( c o l u m n s , , ) o f N unn a nd T r e fl e r ( ) u s i n g D M L . C o l u m n ( ) r e p o r t s t h e o r i g i n a l p a p e r e s t i m a t e s . S t a nd a r d e rr o r s a r e r e p o r t e d i np a r e n t h e s e s . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . S t a nd a r d e rr o r s a d j u s t e d f o r c l u s t e r i n ga tt h ec o un t r y l e v e l a r e r e p o r t e d i n c o l u m n . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . (1) (2) (3)CATE below median CATE above median p -value for the differencePopulation, diff. btw. 2000 and 1990 0.00413 (0.00242) 0.00806 (0.00189) 0.20027 Share with high school degree, diff. btw. 2000 and 1990 0.0086 (0.00199) 0.0029 (0.0027) 0.08938
Share with some college, diff. btw. 2000 and 1990 0.00736 (0.00207) 0.0039 (0.00227) 0.26069Share with college degree, diff. btw. 2000 and 1990 0.00757 (0.00272) 0.00582 (0.00191) 0.59872
Share male, diff. btw. 2000 and 1990 0.00949 (0.00222) 0.0035 (0.00231) 0.06126
Share African American, diff. btw. 2000 and 1990 0.00629 (0.00243) 0.00666 (0.002) 0.90674Share Hispanic, diff. btw. 2000 and 1990 0.00428 (0.00238) 0.00737 (0.00208) 0.32866Unemployment rate, diff. btw. 2000 and 1990 0.00366 (0.00238) 0.00866 (0.00224) 0.12612Married, diff. btw. 2000 and 1990 0.00698 (0.00202) 0.00562 (0.00257) 0.67592Median income, diff. btw. 2000 and 1990 0.00628 (0.00224) 0.00653 (0.0023) 0.93661Share urban, diff. btw. 2000 and 1990 0.00517 (0.00203) 0.00945 (0.0025) 0.18368Population 2000 0.00492 (0.00252) 0.00662 (0.00164) 0.57185
Share with some college 2000 0.00328 (0.00204) 0.00964 (0.00249) 0.04809
Share with college degree 2000 0.00556 (0.00253) 0.00679 (0.00185) 0.6946Share male 2000 0.0055 (0.00194) 0.00976 (0.00277) 0.20794Share African American 2000 0.0025 (0.00271) 0.00739 (0.00172) 0.12759
Share Hispanic 2000 0.00136 (0.00225) 0.00799 (0.00217) 0.03386
Employment rate 2000 0.00557 (0.00232) 0.00771 (0.00215) 0.50069Unemployment rate 2000 0.00541 (0.00214) 0.00741 (0.00235) 0.52906Share married 2000 0.00683 (0.00228) 0.00585 (0.00229) 0.76121Median income 2000 0.00501 (0.00218) 0.00712 (0.00223) 0.50006Share urban 2000 0.00441 (0.0024) 0.00673 (0.0019) 0.44815No. potential cable subscribers 2000 0.00818 (0.00238) 0.00594 (0.00169) 0.44436Decile 1 in no. potential cable subscribers 0.00661 (0.0016) -0.00787 (0.01626) 0.37539Decile 2 in no. potential cable subscribers 0.00664 (0.00165 ) 0.00084 (0.00861) 0.50799
Decile 3 in no. potential cable subscribers 0.00612 (0.00151) 0.0171 (0.0065) 0.0999
Decile 4 in no. potential cable subscribers 0.00634 (0.0017) 0.0077 (0.00393) 0.75084Decile 5 in no. potential cable subscribers 0.00667 (0.00174) 0.00357 (0.00371) 0.44915Decile 6 in no. potential cable subscribers 0.00669 (0.00171) 0.00471 (0.00463) 0.68762Decile 7 in no. potential cable subscribers 0.00668 (0.0017) 0.00531 (0.00492) 0.79269
Decile 8 in no. potential cable subscribers 0.00758 (0.00168) -0.00131 (0.00405) 0.04239
Decile 9 in no. potential cable subscribers 0.0071 (0.00167) 0.00226 (0.00317) 0.17685
Decile 10 in no. potential cable subscribers 0.0045 (0.00207) 0.01139 (0.00188) 0.01393
No. cable channels available 2000 0.00816 (0.00684) 0.0065 (0.00148) 0.812Decile 1 in no. cable channels available 0.00645 (0.0017) 0.00149 (0.02648) 0.85167Decile 2 in no. cable channels available 0.00655 (0.00153) 0.01884 (0.0383) 0.74845Decile 3 in no. cable channels available 0.00657 (0.00161) 0.00758 (0.01372) 0.94203Decile 4 in no. cable channels available 0.00747 (0.00155) -0.0101 (0.01332) 0.18996Decile 5 in no. cable channels available 0.00553 (0.00158) 0.02402 (0.01216) 0.13149Decile 6 in no. cable channels available 0.00569 (0.00164) 0.01323 (0.00648) 0.25923Decile 7 in no. cable channels available 0.00585 (0.00192) 0.00953 (0.00253) 0.24565Decile 8 in no. cable channels available 0.0068 (0.00178) 0.00355 (0.00398) 0.45524
Decile 9 in no. cable channels available 0.00576 (0.00181) 0.01239 (0.00288) 0.05169
Swing district 0.00685 (0.00201) 0.00602 (0.00272) 0.80599
Republican district 0.00693 (0.00187) 0.00084 (0.00264) 0.06021
Notes:
The table reports the effect of Fox News on the Republican vote sharefor towns with values below (column 1) and above (column 2) the median of eachvariable. Column 3 presents the p -value for the null of no difference between theestimates in columns 1 and 2. Standard errors are reported in parentheses. Theestimates are obtained from the causal random forest that includes district dummyvariables. As we are not interested in exploring heterogeneity along the congressionaldistricts, the HTE results for district dummy variables are omitted from the table. (1) (2) (3)CATE below median CATE above median p -value for the differencePopulation, diff. btw. 2000 and 1990 0.00357 (0.00398) 0.00829 (0.00299) 0.34201Share with high school degree, diff. btw. 2000 and 1990 0.0088 (0.00311) 0.00225 (0.00323) 0.14407Share with some college, diff. btw. 2000 and 1990 0.00809 (0.00281) 0.00194 (0.00431) 0.23202Share with college degree, diff. btw. 2000 and 1990 0.00709 (0.00317) 0.00604 (0.00329) 0.8194Share male, diff. btw. 2000 and 1990 0.00975 (0.00356) 0.00308 (0.00268) 0.13407Share African American, diff. btw. 2000 and 1990 0.00547 (0.00346) 0.007 (0.00298) 0.7364Share Hispanic, diff. btw. 2000 and 1990 0.00369 (0.00383) 0.00755 (0.00286) 0.41946Unemployment rate, diff. btw. 2000 and 1990 0.00328 (0.00304) 0.00872 (0.00308) 0.20834Married, diff. btw. 2000 and 1990 0.00622 (0.00327) 0.00639 (0.00339) 0.97002Median income, diff. btw. 2000 and 1990 0.0065 (0.00354) 0.00609 (0.00282) 0.92735Share urban, diff. btw. 2000 and 1990 0.00527 (0.00273) 0.00881 (0.00372) 0.44257Population 2000 0.00577 (0.00398) 0.00636 (0.0027) 0.9022Share with some college 2000 0.00532 (0.00321) 0.00785 (0.00376) 0.60916Share with college degree 2000 0.00545 (0.00296) 0.00672 (0.00318) 0.76975Share male 2000 0.00459 (0.00259) 0.01138 (0.00529) 0.24942Share African American 2000 0.00198 (0.00518) 0.00731 (0.00265) 0.35943Share Hispanic 2000 0.00071 (0.00378) 0.00825 (0.00314) 0.1245Employment rate 2000 0.0043 (0.00293) 0.00892 (0.00416) 0.36452Unemployment rate 2000 0.00539 (0.0027) 0.00728 (0.0035) 0.66907Share married 2000 0.00684 (0.00278) 0.00561 (0.00355) 0.78466Median income 2000 0.00546 (0.00381) 0.00648 (0.00272) 0.82677Share urban 2000 0.00534 (0.00404) 0.00647 (0.00276) 0.81683No. potential cable subscribers 2000 0.00744 (0.00616) 0.00587 (0.00285) 0.81685Decile 1 in no. potential cable subscribers 0.00653 (0.00264) -0.00486 (0.0162) 0.48767Decile 2 in no. potential cable subscribers 0.00655 (0.00268) 0.00209 (0.0116) 0.70797Decile 3 in no. potential cable subscribers 0.00594 (0.00258) 0.01893 (0.0111) 0.25437Decile 4 in no. potential cable subscribers 0.00628 (0.00256) 0.00734 (0.00928) 0.91234Decile 5 in no. potential cable subscribers 0.00677 (0.00273) 0.00051 (0.00724) 0.4189Decile 6 in no. potential cable subscribers 0.00691 (0.00278) 0.00113 (0.00592) 0.37691Decile 7 in no. potential cable subscribers 0.00685 (0.00241) 0.00351 (0.01161) 0.77827Decile 8 in no. potential cable subscribers 0.00741 (0.00283) -0.00051 (0.004) 0.10608Decile 9 in no. potential cable subscribers 0.00683 (0.00294) 0.00274 (0.00409) 0.41683Decile 10 in no. potential cable subscribers 0.004 (0.00384) 0.01147 (0.00369) 0.16066No. cable channels available 2000 0.00562 (0.00773) 0.00678 (0.00255) 0.88643Decile 1 in no. cable channels available 0.00646 (0.00274) -0.00726 (0.03324) 0.68079Decile 2 in no. cable channels available 0.00644 (0.00265) 0.01768 (0.01293) 0.39453Decile 3 in no. cable channels available 0.00672 (0.00267) 0.00203 (0.01179) 0.69811Decile 4 in no. cable channels available 0.00816 (0.00263) -0.01645 (0.01506) 0.10755Decile 5 in no. cable channels available 0.00484 (0.00251) 0.02998 (0.01806) 0.16774Decile 6 in no. cable channels available 0.00537 (0.00289) 0.0133 (0.0045) 0.13846Decile 7 in no. cable channels available 0.00602 (0.00304) 0.00867 (0.0042) 0.60869Decile 8 in no. cable channels available 0.00675 (0.00291) 0.00327 (0.00503) 0.54915Decile 9 in no. cable channels available 0.00543 (0.0027) 0.0139 (0.00631) 0.21689Swing district 0.00634 (0.00308) 0.00736 (0.00515) 0.86489Republican district 0.00665 (0.00286) 0.00079 (0.00604) 0.38064 Notes:
The table reports the effect of Fox News on the Republican vote sharefor towns with values below (column 1) and above (column 2) the median of eachvariable. Column 3 presents the p -value for the null of no difference between theestimates in columns 1 and 2. Standard errors are reported in parentheses. Theestimates are obtained from the cluster-robust causal forest. Variable Importance (%)No. cable channels available 2000 6.52No. potential cable subscribers 2000 5.23Share employed, diff. btw. 2000 and 1990 4.9Share African American 2000 4.74Share married 2000 4.39Unemployment rate, diff. btw. 2000 and 1990 4.22Decile 10 in no. cable channels 2000 4.16Employment rate 2000 3.67Share with high school degree, diff. btw. 2000 and 1990 3.56Share with some college 2000 3.55Population, diff btw. 2000 and 1990 3.41Share male, diff btw. 2000 and 1990 3.38Share Hispanic, diff btw. 2000 and 1990 3.27Median income, diff btw. 2000 and 1990 3.25Median income 2000 3.22Share Hispanic 2000 3.21Share married, diff btw. 2000 and 1990 3.07Share African American, diff btw. 2000 and 1990 3.02Population 2000 3.01Employment rate 2000 2.72Share with some college, diff btw. 2000 and 1990 2.67Share male 2000 2.55Share with college degree 2000 2.49Share with college degree, diff btw. 2000 and 1990 2.23Share with high school 2000 2.14Decile 10 in no. potential cable subscribers 2.08Share urban population, diff btw. 2000 and 1990 1.9Decile 7 in no. cable channels available 1.75Decile 9 in no. cable channels available 1.54Share of urban population 2000 1.4Republican district 0.78Decile 8 in no. cable channels available 0.75Swing district 0.74Decile 9 in no. potential cable subscribers 0.36Decile 8 in no. potential cable subscribers 0.07Decile 7 in no. potential cable subscribers 0.02Decile 6 in no. cable channels available 0.01
Notes:
The table reports the importance of each variable obtained from the causalforest with district dummies. Variables with importance lower than 0.01% are omit-ted.
Variable Importance (%)No. cable channels available 2000 10.4No. potential cable subscribers 2000 8.22Share with some college 2000 4.79Unemployment rate, diff. btw. 2000 and 1990 4.35Decile 9 in no. cable channels 4.16Decile 10 in no. cable channels 4.09Employment rate, diff. btw. 2000 and 1990 3.86Share African American 2000 3.59Median income 2000 3.39Population, diff. btw. 2000 and 1990 3.31Median income, diff. btw. 2000 and 1990 3.1Share married 2000 2.9Share male, diff. btw. 2000 and 1990 2.88Decile 7 in no. cable channels 2.84Unemployment rate 2000 2.56Share African American, diff. btw. 2000 and 1990 2.55Share Hispanic, diff. btw. 2000 and 1990 2.5Share married, diff. btw. 2000 and 1990 2.4Share Hispanic 2000 2.38Share with high school degree, diff. btw. 2000 and 1990 2.28Share urban, diff. btw. 2000 and 1990 2.23Share male 2000 2.2Decile 10 in no. potential cable subscribers 2.19Population 2000 2.14Share with some college, diff. btw. 2000 and 1990 2.08Share with college degree 2000 1.98Employment rate 2000 1.79Share with college degree, diff. btw. 2000 and 1990 1.58Share with high school degree 2000 1.57Share urban 2000 1.55Republican district 1.37Swing district 0.98Decile 8 in no. cable channels 0.97Decile 9 in no. potential cable subscribers 0.51Decile 8 in no. potential cable subscribers 0.21Decile 7 in no. potential cable subscribers 0.06Decile 6 in no. cable channels 0.03Decile 6 in no. potential cable subscribers 0.02
Notes:
The table reports the importance of each variable obtained from the cluster-robust causal forest. Variables with importance lower than 0.01% are omitted. (1) (2) (3)Elastic net Neural network Random forestBest BLP 0.012 0.014 0.011Best GATES 0.115 0.121 0.099
Notes:
The table compares the performance of the three ML methods used to pro-duce the proxy predictors. The performance measures Best BLP and Best GATESare computed as medians over 100 splits.
Table B.10: Teacher Training - Generic Method: GATES of most and leastaffected groups (1) (2) (3)20% most affected 20% least affected DifferenceEffect of teacher training on student achievement 0.164 -0.179 0.36590% Confidence Interval (0.048,0.279) (-0.301,-0.058) (0.198,0.533) p -value 0.011 0.092 0.001 Notes:
The estimates are obtained using neural network to produce the proxy pre-dictor
S(Z) . The values reported correspond to the medians over 100 splits. (1) (2) (3)20% most affected 20% least affected p -value for the differenceBaseline instructional practices of teacher 0.211 0.053 0.000(0.149,0.272) (-0.009,0.114)Teacher’s baseline classroom management 0.065 0.074 0.000(0.003,0.127) (0.012,0.136)Teacher gender 0.529 0.536 1.000(0.497,0.562) (0.504,0.568)Teacher certification dummy 0.970 1.000 0.000(0.962,0.977) (0.992,1.008)Student’s baseline instrumental motivation for math -0.111 0.224 0.000(-0.173,-0.050) (0.162,0.285)Student’s baseline time spent each week studying math -0.073 0.204 0.000(-0.142,-0.004) (0.135,0.273)Student’s baseline math self-concept -0.380 0.317 0.000(-0.441,-0.319) (0.256,0.378)Teacher majored in math 0.309 0.507 0.000(0.278,0.340) (0.475,0.538)Mother education level 0.570 0.425 0.000(0.538,0.602) (0.393,0.457)Teacher’s baseline communication -0.088 0.251 0.000(-0.152,-0.025) (0.187,0.314)Student’s baseline intrinsic motivation for math -0.294 0.299 0.000(-0.356,-0.232) (0.237,0.362)Baseline teacher care -0.165 0.311 0.000(-0.228,-0.103) (0.249,0.374)Household asset index -0.421 0.240 0.000(-0.491,-0.351) (0.170,0.310)Father education level 0.583 0.589 1.000(0.551,0.614) (0.557,0.621) Notes:
The table shows the average value of the teacher and student characteristicsfor the most and least affected groups. The estimates are obtained using neuralnetwork to produce the proxy predictor
S(Z) . Confidence intervals with 90% nominallevel are reported in parenthesis. All variables, except
Teacher gender, Teachercertification dummy, Teacher majored in math, Mother education level and
Fathereducation level are normalized. The values reported correspond to the medians over100 splits. S ( Z ) Variable CorrelationTeacher college degree -0.237Teacher training hours 0.125Teacher ranking 0.116Student age 0.111Teacher experience (years) 0.101Student female -0.094Teacher age 0.089Math score at baseline (normalized) 0.075Student baseline math anxiety 0.063Class size -0.060Baseline instructional practices of teacher 0.053Teacher’s baseline classroom management 0.051Teacher gender -0.045Teacher certification dummy 0.036Student’s baseline instrumental motivation for math 0.025Student’s baseline time spent each week studying math 0.022Student’s baseline math self-concept -0.021Teacher majored in math -0.016Mother education level 0.009Teacher’s baseline communication -0.008Student’s baseline intrinsic motivation for math 0.008Baseline teacher care -0.006Household asset index -0.005Father education level -0.003
Notes:
The table reports the correlation of each covariate with the proxy predictor S ( Z ). Figure 2.1: ˆ m ( · )Figure 2.2: ˆ g ( · ) Notes:
The figure plots the seven largest lasso coefficients of the interaction terms,obtained estimating the nuisance functions m ( · ) and g ( · ). Colons indicate interac-tions of variables. The treatment variable D, is the first year effective corporate taxrate. The dependent variable Y is the average entry rate. The lasso coefficients arecalculated as the median over splits.). Colons indicate interac-tions of variables. The treatment variable D, is the first year effective corporate taxrate. The dependent variable Y is the average entry rate. The lasso coefficients arecalculated as the median over splits.