[PDF] The Value Added of Machine Learning to Causal Inference: Evidence from Revisited Studies

Abstract

A new and rapidly growing econometric literature is making advances in the problem of using machine learning methods for causal inference questions. Yet, the empirical economics literature has not started to fully exploit the strengths of these modern methods. We revisit influential empirical studies with causal machine learning methods and identify several advantages of using these techniques. We show that these advantages and their implications are empirically relevant and that the use of these methods can improve the credibility of causal analysis.

Full PDF

TThe Value Added of Machine Learning toCausal Inference:Evidence from Revisited Studies ∗ Anna Baiardi † and Andrea A. Naghi ‡ December 2020

Abstract:

A new and rapidly growing econometric literature is makingadvances in the problem of using machine learning methods for causal in-ference questions. Yet, the empirical economics literature has not startedto fully exploit the strengths of these modern methods. We revisit inﬂu-ential empirical studies with causal machine learning methods and identifyseveral advantages of using these techniques. We show that these advan-tages and their implications are empirically relevant and that the use ofthese methods can improve the credibility of causal analysis.

Keywords: machine learning, causal inference, average treatment eﬀects,heterogeneous treatment eﬀects.

J.E.L. Classiﬁcation : C01, C21, D04 ∗ Baiardi acknowledges support from EU Horizon 2020, Marie Sk(cid:32)lodowska-Curie indi-vidual grant (No. 840319). Naghi acknowledges support from EU Horizon 2020, MarieSk(cid:32)lodowska-Curie individual grant (No. 797286). Financial support from the UnitedNations Sustainable Development Funds is also gratefully acknowledged. We thank par-ticipants at the Machine Learning for Economics Workshop (at Barcelona GSE SummerForum 2019) and at the Netherlands Econometrics Study Group Meeting 2020 for veryhelpful comments. Nadja van’t Hoﬀ and Christian Wirths provided excellent researchassistance. † Department of Economics, Erasmus University and Tinbgergen Institute. Email:[email protected]. ‡ Department of Econometrics, Erasmus University and Tinbergen Institute. Email:[email protected]. a r X i v : . [ ec on . GN ] J a n Introduction

One of the key goals of empirical research in economics is to estimate thecausal eﬀect of a variable of interest on a targeted outcome. To avoid bi-ases in the coeﬃcients of interest due to omitted variables, particularly inobservational studies, it is often desirable to include a large number of con-trols. However, even when the number of raw covariates is relatively small,the inclusion of technical controls (e.g. dummy variables for geographicallocation, time periods, etc.), interactions and transformations can lead tosettings in which the number of covariates is large relative to the samplesize.Machine learning (ML) methods can potentially be useful in such set-tings. However, standard ML prediction models are aimed for fundamen-tally diﬀerent problems than most of the empirical work in economics. MLmethods are designed and optimized for predicting the outcome in a testsample. Thus, a model is selected by optimizing the goodness of ﬁt on theheld-out test set. In contrast, in empirical economic research, the goodnessof ﬁt of a model is oftentimes reduced when estimating a causal eﬀect, andthe predictive accuracy is sacriﬁced in order to learn more deeply about afundamental relationship that can guide policy decisions and counterfac-tual predictions (Athey and Imbens, 2019). These fundamental diﬀerenceswill eventually generate biased estimates if standard

ML techniques, de-signed for prediction, are used in the context of causal inference. Nev-ertheless, a new and rapidly growing econometric literature is making ad-vances in the problem of using ML methods for causal inference questions(see, e.g., Chernozhukov et al., 2018a; Athey et al., 2018; Wager and Athey,2018; Chernozhukov et al., 2018b). This literature brings in new insightsand theoretical results that are novel for both the ML and the economet-rics/statistics literature. Despite these advances, the empirical economicsliterature has not started yet to fully exploit the strengths of these new Note that by ’prediction’ here, we do not mean ’forecasting’. Rather, we refer toa setting where we observe both the outcome and the features/covariates in a trainingsample and the aim is to predict the outcomes for each observation in an independenttest sample, based on the actual values of the covariates in that test sample. The main underlying reason is that high dimensional regression adjustments such aslasso, ridge, elastic net etc., shrink the estimated eﬀects by construction, and ignoringthis shrinkage will lead to biased treatment eﬀect estimates.

The Quarterly Journal of Economics, American Eco-nomic Journal: Macroeconomics, American Economic Journal: AppliedEconomics.

We choose papers for which the full replication data set isavailable either on the journal’s website or on the authors’ website. For theATE, we revisit three observational studies: the study of Djankov et al.(2010) on the eﬀect of corporate taxes on investment and entrepreneur-ship, the analysis of Alesina et al. (2013) on the long-term eﬀect of ploughagriculture on gender norms, and the paper by Nunn and Treﬂer (2010)on the eﬀect of skill-biased tariﬀs on long-term economic growth. For theHTE, we select one observational study and one randomized control trial:we extend the observational study by DellaVigna and Kaplan (2007), whichinvestigates the eﬀect of Fox News on the Republican vote share, and theanalysis by Loyalka et al. (2019) on the eﬀect of a teacher training random-ized intervention on student performance. All these papers include careful3conometric analyses of the main research question and mechanisms, whichwe do not aim to re-examine in full. We instead focus on analyzing themain questions.Our ﬁndings show important diﬀerences in the ATE and HTE estimatescompared to the traditional methods, both in terms of size of the treatmenteﬀect estimates, and in terms of statistical signiﬁcance. From our results,we derive four main observations about the reasons why causal machinelearning methods are relevant for causal analysis and add value relative tothe traditional methods. These observations are supported by the theoret-ical econometrics literature on causal ML (see, for example, Athey et al.,2018; Chernozhukov et al., 2018b; Wager and Athey, 2018).Firstly, causal ML methods are powerful tools in using data to recovercomplex interactions among variables and ﬂexibly estimate the relationshipbetween the outcome, the treatment indicator and covariates. This featureis key when drawing inference based on the assumption that the treatmentis unconfounded as in the case of most of the revisited studies, since thisassumption is not testable. As some covariates can be correlated with boththe treatment variable and the outcome, failing to condition on all relevantconfounders may lead to biased estimates for the treatment eﬀect. For ex-ample, for the eﬀect of corporate taxes on investment and entrepreneurship,the original analysis in Djankov et al. (2010) shows a negative and signiﬁ-cant eﬀect of corporate taxes on investment and entrepreneurship, but theauthors show that these results do not survive when conditioning on allthe potential controls at once. However, when implementing DML, we ob-tain larger estimates compared to Djankov et al. (2010), which are oftenstatistically signiﬁcant. Similarly, our DML results for the eﬀect of ploughcultivation on gender roles suggest a larger eﬀect of the plough compared tothe ﬁndings in Alesina et al. (2013), when we use the instrumental variablestrategy employed in the original analysis. Furthermore, our analysis of theeﬀect of skill-biased tariﬀs on growth suggests a smaller eﬀect comparedto Nunn and Treﬂer (2010), which is often not statistically signiﬁcant. Wethus argue that the DML estimates are more robust to potential nonlinearconfounders. It is important to note here that the idea of estimating treatment eﬀects withoutmaking parametric assumptions about the way in which the covariates enter the equa- systematicmodel selection . ML methods search for the best functional forms by es-timating and comparing a wide range of alternative model speciﬁcations;the model selection is thus data-driven and fully documented. For example,our results for the eﬀect of corporate taxes, originally explored by Djankovet al. (2010), show that the data-driven model selection implemented byDML, which keeps a smaller set of inﬂuential confounding factors fromamong a large set of potential controls, leads to larger coeﬃcients in abso-lute value and lower standard errors compared to OLS regressions whereall the covariates are included. With the traditional approach to modelselection, uncertainty about the correct speciﬁcation of the model can leadto choices that are relatively ad hoc ; diﬀerent speciﬁcations may lead to dif-ferent point estimates, which in turn may lead to diﬀerent policy decisions.Moreover, we further illustrate how these methods are also very useful toolsfor supplementary analyses or robustness checks . Typically, supplementaryanalysis is performed by presenting a number of selected regression speciﬁ-cations, while the approach of causal ML methods is more systematic, andensures that important transformations of covariates that are not consid- tion has already been considered in the semiparametric econometrics literature (see thereview paper of Imbens and Wooldridge, 2009, and Imbens and Rubin, 2015) . However,in practice, these semiparametric kernel methods quickly break down if they have to dealwith more than a few covariates. p -values for single hypothesis testingare not reliable. This is due to the multiple hypothesis testing problem,which can occur when researchers search iteratively for treatment eﬀectheterogeneity, over a large number of covariates. , The econometric theory literature on adapting standard machine learn-ing techniques to causal inference questions is by now fast growing. See forexample Chernozhukov et al. (2017), Chernozhukov et al. (2018a), Atheyet al. (2018), Farrell et al. (2018) Colangelo and Lee (2020) for the ATE;and Athey and Imbens (2016), Wager and Athey (2018), Athey and Wa- While solutions have been proposed to correct for the issue of multiple hypothesistesting (for example, List et al., 2016), when the number of covariates is large, the powerof these approaches to detect heterogeneity is low (Athey and Imbens, 2017). A related issue is the ex-post selection of signiﬁcant heterogeneous eﬀects. To avoidthis problem, in randomized control trials researchers are often required to specify beforethe experiment which heterogeneous eﬀects they are interested to look into, in order toavoid searching for, and only reporting, signiﬁcant eﬀects. However, this limits theability of the researcher to ﬁnd unexpected relevant heterogeneity. Causal ML methodsensure that relevant heterogeneity is not missed while also providing valid conﬁdenceintervals. In addition, in observational studies, where pre-analysis plans are not commonpractice, causal ML methods can be particularly useful. early applications . See forexample, Davis and Heller (2017b), Davis and Heller (2017a), Knaus et al.(2020), Strittmatter (2019) and Bertrand et al. (2017) for the causal ran-dom forest, and Deryugina et al. (2019) for the generic machine learning.In what follows, we present our methodology and main ﬁndings on theATE using double machine learning in Section 2. The methodology andanalysis of HTE via the causal random forest is summarized in Section3. Section 4 focuses on the methodology and analysis of HTE using thegeneric machine learning method. Finally Section 5 concludes.

This section contains the analysis on the ATE for the eﬀect of corporatetaxes on investment and entrepreneurship (Djankov et al., 2010), the eﬀectof plough agriculture on gender roles (Alesina et al., 2013), and the eﬀectof skill-biased tariﬀs on growth (Nunn and Treﬂer, 2010), using the doublemachine learning method (Chernozhukov et al., 2017).

The method is suitable in settings with a large number of covariates relativeto the sample size (either because the number of raw covariates is large tobegin with, or there is a large number of technical controls), where typicalnon-parametric kernel or spline methods break down.The main model speciﬁcation of the method, in the notation of Cher-nozhukov et al. (2018a), is the partially linear regression: Y = Dθ + g ( X ) + U (1)7 = m ( X ) + V (2)where Y is the outcome, D is the treatment variable of interest, X isa (high-dimensional) vector of controls, and U and V are disturbances.Equation (1) is the main equation of interest and the parameter θ is thetreatment eﬀect we would like to estimate. In this model, θ quantiﬁes the average treatment eﬀect . The second equation is not of direct interest, butit keeps track of the dependence of the treatment on confounders. Thecovariates are related to the treatment through the function m ( X ) and tothe outcome variable through the function g ( X ). While m ( X ) and g ( X )can be nonlinear, the treatment variable, D , enters the model linearly (andadditively). In observational studies, the function m is typically nonzero,which means that the treatment assignment is not random, but dependson the covariates. The partially linear regression model is also extended toa partially linear IV model to allow for endogenous treatment. We refer tothis model as DML-IV.A ﬁrst idea one might have for estimating θ with ML methods wouldbe to use a predictive-based ML approach and predict Y using D and X to obtain D ˆ θ + ˆ g ( X ). This can be done for example by an iterativemethod that alternates between estimating g with some ML method and θ with OLS. While this ’naive’ ML approach will have very good predictionperformances, the iterative ML estimator will be heavily biased with aslower than 1 / √ n convergence rate. The primarily reason for this poorperformance is the bias introduced by regularization . In order to optimizeprediction and avoid overﬁtting the data with complex functional forms,ML methods use regularization and shrink the less important coeﬃcientstowards zero. This reduces overﬁtting by decreasing the variance of theestimator but at the same time introduces bias. The bias in estimating g transfers to the parameter of interest θ . The issue is similar to the omittedvariable bias.To overcome regularization bias, Chernozhukov et al. (2017) propose‘double machine learning’ i.e., solving two predictions problems instead ofone. First, a ML model is ﬁtted for m in the treatment equation, and theeﬀect of X is partialed out from D to get the residuals ˆ V = D − ˆ m ( X ).8econd, a ML method is ﬁtted for g in the outcome equation and theresiduals ˆ W = Y − ˆ g ( X ) are obtained. Finally, the residuals ˆ W areregressed on the residuals ˆ V to obtained the ‘debiased’ machine learningestimator, ˇ θ . It can be shown that by orthogonalizing D with respect to X and eliminating the eﬀect of confounders by subtracting an estimate of g , ˇ θ removes the eﬀect of regularization bias. However, ˇ θ is still subject to bias due to overﬁtting . For instance, whenˆ g is overﬁt, it will mistake noise for signal and thus it will pick up someof the noise U from the outcome equation. If U and V are correlated, theestimation error in ˆ g will be correlated with V . To break this correlationand avoid bias due to overﬁtting, one can rely on sample splitting. To thisend, the data is partitioned into two subsamples: a main sample and anauxiliary sample. The ML models for the two nuisance functions m and g are ﬁt on the auxiliary sample, while the residual on residual regressionto obtain ˇ θ is ﬁt on the main sample.A drawback of sample splitting is that the estimator of the parameterof interest θ is obtained using only the main sample, which can lead to lossof eﬃciency. However, one can switch the role of the main and auxiliarysamples (procedure called cross-ﬁtting ) and average the results, which willlead to a more eﬃcient estimator. In addition, one can perform a K -foldversion of the cross-ﬁtting procedure, where the size of each fold is n/K .Each sample partition or fold is successively taken as the main sample whilethe complement for each fold will be the auxiliary sample. One can takethen the average of the estimates over the K -folds. To make the resultsrobust to data partitioning, the splitting in folds procedure is performed S times, and the ﬁnal DML estimator is the mean (or median) over thesplits. The median version is more robust to outliers and this is the one weuse in the applications. The nuisance functions m and g can be estimated with a variety of ML methodssuch as: lasso, regression trees, random forest, boosting, neural networks, or hybridmethods. This is because the scaled estimation error, √ n (ˇ θ − θ ), contains now a term basedon the product of two estimation errors (the estimation errors in ˆ m and in ˆ g ), whichvanishes faster than the equivalent term obtained from using the naive estimator thatdepends on the estimation error of ˆ g . .2 Applications with Double Machine Learning The ﬁrst paper that we revisit us-ing causal machine learning methods investigates the relationship betweencorporate taxes on investment and entrepreneurship (Djankov et al., 2010).This is an observational study that shows a negative eﬀect of corporatetaxes on investment and entrepreneurship, by estimating OLS country-level regressions with diﬀerent measures of corporate tax rates for the year2004. The sample includes a set of 50-85 countries, depending on thespeciﬁcation. In the original paper, four outcome variables are examined:investment as a percentage of GDP, FDI as a percentage of GDP, businessdensity per 100 people, and the average entry rate. Three measures of cor-porate taxes are considered: statutory corporate tax rates, actual ﬁrst-yearcorporate income tax liability of a new company, and the tax rate whichtakes into account actual depreciation schedules going ﬁve years forward.The original paper reports the results for several regression speciﬁca-tions with diﬀerent sets of control variables, to account for potential con-founders that correlate with corporate tax rates, and are also determinantsof the outcomes. Djankov et al. (2010) present regression results where theﬁrst three sets of covariates are added separately. A ﬁnal robustness checkincludes all control variables (12 in total) in the same regression. In thespeciﬁcations which include only one set of controls at a time, the papershows a negative and statistically signiﬁcant eﬀect of corporate taxes onentrepreneurship and investment. However, when adding all the controls,the relationship is still negative, but the coeﬃcients are smaller in size andno longer statistically signiﬁcant.

DML Analysis.

We revisit the ﬁnal robustness check of the paper,which includes all four sets of covariates at the same time, using the DMLpartially linear model. Table 1 presents the results. Columns (1) to (7) The ﬁrst set of controls includes measures of other taxes; the second set includesmeasures for the number of other tax payments made and for tax evasion; the third setincludes measures for institutions; the fourth set includes measures of inﬂation. SectionA.1 of the Appendix includes more details on the regressions estimated in Djankov et al.(2010) and describes the control variables. (1) (2) (3) (4) (5) (6) (7) (8)Lasso Reg. Tree Boosting Forest Neural Net. Ensemble Best OLS

Panel A: Investment 2003-2005

Statutory corporate tax rate -0.074 -0.069 -0.068 -0.07 -0.056 -0.066 -0.071 -0.064(0.09) (0.072) (0.076) (0.087) (0.102) (0.087) (0.088) (0.098)First-year eﬀective tax rate -0.114 -0.129 -0.154 -0.144 -0.122 -0.13 -0.133 -0.117(0.094) (0.087) (0.093) (0.096) (0.097) (0.092) (0.095) (0.106)Five-year eﬀective tax rate -0.187 -0.182 -0.211 -0.21 -0.217 -0.216 -0.207 -0.189(0.089) (0.089) (0.092) (0.097) (0.103) (0.095) (0.101) (0.118)Observations 61 61 61 61 61 61 61 61

Panel B: FDI 2003-2005

Statutory corporate tax rate -0.148 -0.157 -0.153 -0.14 -0.085 -0.133 -0.114 -0.030(0.083) (0.086) (0.092) (0.094) (0.093) (0.088) (0.09) (0.066)First-year eﬀective tax rate -0.141 -0.194 -0.178 -0.157 -0.136 -0.161 -0.137 -0.1(0.091) (0.081) (0.081) (0.074) (0.078) (0.08) (0.079) (0.071)Five-year eﬀective tax rate -0.147 -0.177 -0.167 -0.165 -0.139 -0.157 -0.14 -0.095(0.084) (0.073) (0.074) (0.077) (0.082) (0.077) (0.076) (0.081)Observations 61 61 61 61 61 61 61 61

Panel C: Business density

Statutory corporate tax rate -0.062 -0.092 -0.069 -0.07 -0.056 -0.066 -0.06 -0.034(0.066) (0.072) (0.061) (0.063) (0.077) (0.069) (0.064) (0.083)First-year eﬀective tax rate -0.104 -0.156 -0.124 -0.122 -0.105 -0.114 -0.1 -0.068(0.076) (0.082) (0.07) (0.069) (0.085) (0.072) (0.07) (0.092)Five-year eﬀective tax rate -0.091 -0.139 -0.122 -0.107 -0.115 -0.114 -0.104 -0.070(0.076) (0.08) (0.071) (0.067) (0.087) (0.074) (0.075) (0.103)Observations 60 60 60 60 60 60 60 60

Panel D: Average entry rate 2000-2004

Statutory corporate tax rate -0.112 -0.147 -0.141 -0.127 -0.067 -0.112 -0.106 -0.029(0.073) (0.068) (0.064) (0.065) (0.084) (0.067) (0.069) (0.086)First-year eﬀective tax rate -0.130 -0.144 -0.143 -0.125 -0.131 -0.126 -0.117 -0.083(0.072) (0.064) (0.065) (0.066) (0.086) (0.07) (0.072) (0.094)Five-year eﬀective tax rate -0.154 -0.153 -0.164 -0.164 -0.191 -0.168 -0.167 -0.133(0.084) (0.069) (0.07) (0.07) (0.091) (0.08) (0.077) (0.103)Observations 50 50 50 50 50 50 50 50Raw covariates 12 12 12 12 12 12 12 12

Notes:

Analysis of Table 5D of Djankov et al. (2010) using DML. Column 8 reportsthe original paper estimates. Standard errors are reported in parentheses. Standarderrors adjusted for variability across splits using the median method are reportedfor the DML estimates. The number of covariates does not include the treatmentvariable. display the DML point estimates for the eﬀect of corporate taxes on in-vestment and entrepreneurship, using diﬀerent ML methods to estimatethe nuisance functions. Further details on how the DML estimates areobtained, the methods used and the tuning parameters are described inSection A.1 of the Appendix.We notice that all the DML point estimates have negative signs andgenerally similar magnitudes across the ML methods. Compared to the11riginal paper results with the full set of covariates, reported in column (8),the magnitude of the DML coeﬃcients is higher in absolute value, and thestandard errors are lower in most regressions. Additionally, the results arestatistically signiﬁcant, at least at the 10% level, in half (42 out of 84) of theregressions. The main diﬀerence between our and Djankov et al. (2010)’sapproach is that the original paper results are based on the assumptionof linearity and additivity of the conditional expectations, while the DMLmethod allows for a more ﬂexible speciﬁcation. Thus, our ﬁndings are morerobust to potential nonlinear confounders compared to the original paperestimates. A researcher might be interested in investigating what are thesenonlinear terms that make the estimates diﬀerent. However, this can bea challenging task when ML methods (such as neural networks, hybridmethods etc.) are used to estimate the nuisance functions. What canpotentially be done is analyzing the lasso coeﬃcients that are not shrunkto zero and looking for nonlinearities among these. As an example, weshow in Figure B.1 the most relevant among the nonlinear terms selectedby the lasso, for one of the DML regressions reported in Table 1. Here, wenote that several nonlinear terms appear in both the treatment nuisancefunction ˆ m ( · ) and in the outcome nuisance function ˆ g ( · ). This is suggestiveof the fact that there are nonlinearities that are correlated with both thetreatment variable and the outcome. These were missed by the analysisin the original paper, and their omission could lead to biased coeﬃcientsof the corporate taxes variables. In this case, controlling for all relevantconfounders strengthens the main results of the original analysis: in manycases the DML treatment eﬀect estimates are larger in absolute value, andstatistically signiﬁcant.This empirical example is also useful to illustrate a typical trade-oﬀthat the applied researcher might face. On the one hand, the researcherwants to control for as many potential confounders as possible, in order toimprove the credibility of the unconfoundedness assumption. On the otherhand, naively controlling for a large set of covariates, especially when thesample size is small, can lead to imprecise estimates and larger standarderrors. Notice that in this example, the authors implement a ”kitchen sink” Further details about the lasso coeﬃcients analysis are reported in Section A.1 ofthe Appendix.

The study by Alesina et al. (2013)examines the relationship between historical plough agriculture and genderroles. The mechanism is the following: since the plough requires physicalstrength to be operated, in areas where plough agriculture was widespread,men had an advantage in agriculture compared to women. This would re-sult in societies in which men worked in farming, whereas women’s workwould be performed mainly within the home. The division of labour bygender would translate into norms and cultural beliefs about the role ofwomen in the society, which would still persist nowadays, even after soci-eties have moved out of agriculture as the main economic activity.In the paper, the authors present results using country-level and individual-level regressions. We revisit the main question addressed in the originalpaper, focusing on the country-level results, as the majority of the re-gressions reported in the original paper are based on this data. For thecountry-level baseline regressions, estimated with OLS, three contempo-rary outcome variables are examined as measures of gender roles: femalelabour force participation, the share of ﬁrms with a woman among its prin-cipal owners, and the proportion of seats held by women in the nationalparliament. The treatment variable measures the share of individuals ineach country whose ancestors practiced plough agriculture. The baselineregressions control for income, income squared, and for measures of thehistorical characteristics of the ethnicities living in a country. Continentﬁxed eﬀects are added in some speciﬁcations. As mentioned by Alesina et al. (2013), concerns about potential endo- More details on the regressions and the control variables from the study of Alesinaet al. (2013) are described in Section A.2 of the Appendix. Second, the authors use an instrumental variable approach, whichexploits the fact that plough adoption is correlated with the suitability ofthe land for cereal crops that would beneﬁt, and crops that would not ben-eﬁt, from the plough. To this end, two instruments for plough adoptionare constructed, based on the analysis by Pryor (1985). The ﬁrst is thesuitability for ”plough-positive” (i.e. which beneﬁt most from the plough)cereal crops, and the second is the suitability for ”plough-negative” (i.e.which beneﬁt least from the plough) cereal crops. DML Analysis.

In our analysis, we re-examine both the country-levelOLS and IV regressions, applying the DML method. For the OLS analysis,we begin by estimating a DML partially linear model that only includes thebaseline set of controls as raw covariates. We then revisit the robustnessanalysis of this speciﬁcation, by including as raw covariates the largest set ofcontrols used in the robustness checks (this corresponds to Table 7, column8 of the original paper), to which we also add the continent ﬁxed eﬀects. This amounts to a total of 36 raw covariates. For the IV analysis, as noted The additional controls are listed in Section A.2 of the Appendix. See Alesina et al. (2013) for details on the data used and how the instruments areconstructed. The paper also shows that the two instruments, indeed, predict ploughadoption. When revisiting the robustness analysis with DML, we include continent ﬁxed ef-fects, even though the original paper did not include them in their most complete ro-bustness checks. As causal ML methods can handle a large number of covariates, weinclude all the covariate which were considered in the original paper, to ensure that allpotential confounders are taken into account.

14n Alesina et al. (2013), the main concern with the instrumental variablestrategy is the possibility that suitable areas for diﬀerent crops could becorrelated with geographic characteristics that have an eﬀect on gendernorms through other channels, besides plough adoption (i.e., the exclusionrestriction might not hold). Therefore, for the IV analysis, in addition tothe baseline controls and in line with the original paper, we consider the geo-climatic characteristics the authors use in their IV robustness checks (TableA14 of the Online Appendix of the original paper). To these variables, weagain add the continent ﬁxed eﬀects. Further details on how the DMLestimates are obtained and the tuning choices are described in Section A.2of the Appendix.Table B.2 reports the results of the DML partially linear model thatreplicates the baseline regression. In accordance with the original paper, thetreatment eﬀect estimates are negative and statistically signiﬁcant. Theyare also close to the original estimates (reproduced here for conveniencein column 8 of Table B.2), and reassuringly, fairly stable across the MLmethods. We ﬁnd however very diﬀerent results when carrying out therobustness analysis of this baseline speciﬁcation with the DML method.Panel A of Table 2 reports the results. While the eﬀect is still negative,albeit much smaller in absolute value, statistically signiﬁcance is now lost.Interestingly, when Alesina et al. (2013) include all covariates at once (theestimate is reproduced in the last column of our Table 2), the treatmenteﬀect becomes smaller in absolute value, compared to when groups of co-variates are added separately (see their Table 7, columns 1 to 7), or com-pared to the baseline speciﬁcation (reproduced in column 8 of our TableB.2). With DML, the treatment eﬀect of interest does not only becomesmaller, but also statistically insigniﬁcant.Our ﬁndings up to this point would lead us to (mistakenly) concludethat the negative eﬀect of plough adoption on attitudes towards genderroles may not be as large as suggested by the original analysis, and thatthe eﬀect is not statistically signiﬁcant. However, our estimates from theDML partially linear model may still be subject to endogeneity. Whileﬂexibly controlling for a large number of covariates can account for theconfounding eﬀect of observed characteristics, the remaining concern is thatplough adoption may be correlated with unobserved characteristics that15 a b l e : O n t h e o r i g i n s o f g e nd e rr o l e s : C o un tr y - l e v e l e s t i m a t e s w i t h f u ll s e t o f c o n tr o l s ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : D M L , P a r t i a ll y l i n e a r M od e l . O u t c o m e : F e m a l e l a b o u r f o r ce pa r t i c i pa t i o n T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t S L S P a n e l B : D M L - I V . O u t c o m e : F e m a l e l a b o u r f o r ce pa r t i c i pa t i o n T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f t h e m a i n r o bu s t n e ss c h ec k s o f A l e s i n a e t a l. ( ) u s i n g D M L . C o l u m n r e p o r t s t h e r e s u l t s o f t h e m o s t c o m p l e t e r o bu s t n e ss c h ec k s f o r t h e O L S a nd I V s p ec i ﬁ c a t i o n s i n t h e o r i g i n a l p a p e r . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . R o bu s t s t a nd a r d e rr o r s a r e r e p o r t e d i n c o l u m n . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . Overall, when looking at both the robustnessanalysis and the IV analysis and comparing them to the baseline results,we notice that our estimates move in the same direction as the originalpaper estimates, but our estimates move even more, supporting the ideathat DML controls more ﬂexibly for relevant covariates.This empirical example is a good illustration to show the gains fromcombining modern ML tools with quasi-experimental methods such asinstrumental variables. While causal ML methods can make the uncon-foundedness assumption more plausible by ﬂexibly controlling for observedconfounders, they cannot account for unobserved confounders. In suchsettings, the researcher could combine causal ML methods with quasi- As explained above, our DML speciﬁcation diﬀer from the original paper’s robust-ness analysis because it considers nonlinearities and it includes continent ﬁxed eﬀects.Therefore, the diﬀerences between the DML and the original estimates could, in princi-ple, be driven by the continent ﬁxed eﬀects, and not by the nonlinearities. The originalpaper shows that adding the continent ﬁxed eﬀects to the baseline speciﬁcation leadsto very small changes in the OLS estimates (see Table 4 in the original paper), whileit results in larger changes in the IV case (see Table 8 in the original paper). However,even in the IV case, including the continent ﬁxed eﬀects only increases the absolute sizeof the plough coeﬃcient by 3-4 percentage points, while the DML coeﬃcients exceedthe OLS and 2SLS estimates by more than double that amount (with the except of theneural network and ensemble estimates). Thus, we conclude that allowing for a moreﬂexible nuisance function is likely to be driving at least part of the diﬀerences betweenthe DML and the 2SLS (and OLS) estimates.

The study by Nunn and Treﬂer(2010) investigates the relationship between skill-biased tariﬀs, i.e., a tar-iﬀ structure that disproportionately favours skill-intensive industries, andlong-term economic growth. The authors develop a theoretical frameworkbased on Grossman and Helpman (1991) that shows how tariﬀs that fo-cus on skill-intensive industries can lead to a disproportional expansionof skill-intensive industries, which then leads to higher long-term growth.Furthermore, using both cross-country and industry level data, the pa-per provides evidence of a positive relationship between the two variables,and delves into the mechanisms of this relationship. The ﬁndings suggest18hat the mechanisms from the theoretical framework can explain only partof the total correlation between skill-biased tariﬀs and growth. The pa-per attributes the remaining part of the correlation to the endogeneity ofskill-biased tariﬀs, and in particular to the relationship between institu-tions and the skill-bias of tariﬀs: countries with good institutions tend toprotect more skill-intensive industries.In Nunn and Treﬂer (2010), three measures of the skill-bias of tariﬀsin the initial time period are used: the correlation between the indus-try tariﬀs and the industry’s skill-intensity, and two measures based onthe diﬀerence between the log average tariﬀs in skill-intensive industriesand log average tariﬀs in unskilled-intensive industries, which use diﬀerentcut-oﬀ values for industry skill-intensity. In the country-level estimates,the outcome is log annual per capita GDP growth, and the regressions in-clude a set of control variables. The country-level regressions includes 63observations.For the industry-level estimates, the outcome variable is the averageannual log change in industry output in each country, and the regressionsinclude all the controls that appear in the country-level regressions, plusindustry ﬁxed eﬀects. These regressions include 1004 data data points for59 countries. An additional variable (the initial industry tariﬀ) is includedin some speciﬁcations to capture a potential mechanism: skill-biased tariﬀscan shift resources towards skill-intensive industries that generate positiveexternalities, thus leading to higher long-term growth. Thus, industriesthat have higher initial tariﬀs should have higher long-run output. If thischannel can explain the eﬀect of skill-bias on growth, the coeﬃcient of theskill-bias of tariﬀs would decrease in size when this variable is included inthe regression.

DML Analysis.

We revisit the country and industry-level regressionsreported in Tables 4 (columns 1, 2 and 4), Table 5 (columns 1, 2 and 4) andTable 6 (columns 1, 3 and 7) of Nunn and Treﬂer (2010). Further details on The initial time period is 1972 for 21 countries, 1980–83 for 30 countries and 1985-87for 12 countries. The end period is 2000 for most countries, except for 3 of them, forwhich data ends in 1996. See Nunn and Treﬂer (2010), Table 1 for a list of the countriesincluded and the respective time periods. Further details on the regressions estimated by Nunn (2007) and on the controlvariables are described in Section A.3 of the Appendix. a b l e : T h e S tr u c t u r e o f T a r i ﬀ s a nd L o n g - T e r m G r o w t h : C o un tr y - l e v e l e s t i m a t e s ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : S k i ll t a r i ﬀ c o rr e l a t i o n S k ill t a r i ﬀ c o rr e l a t i o n . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l B : T a r i ﬀ d i ﬀ e r e n t i a l ( l o w c u t - o ﬀ ) T a r i ﬀ d i ﬀ e r e n t i a l ( l o w c u t - o ﬀ ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l C : T a r i ﬀ d i ﬀ e r e n t i a l ( h i g h c u t - o ﬀ ) T a r i ﬀ d i ﬀ e r e n t i a l ( h i g h c u t - o ﬀ ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f T a b l e ( c o l u m n s , , ) o f N unn a nd T r e ﬂ e r ( ) u s i n g D M L . C o l u m n ( ) r e p o r t s t h e o r i g i n a l p a p e r e s t i m a t e s . S t a nd a r d e rr o r s a r e r e p o r t e d i np a r e n t h e s e s . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . Heterogeneous Treatment Eﬀects withCausal Random Forest

This section focuses on the analysis of HTE for the eﬀect of Fox News onRepublican voting (DellaVigna and Kaplan, 2007) using the causal randomforest method (Wager and Athey, 2018).

The causal random forest method is an adaptation of the original randomforest for prediction, introduced by Breiman (2001), to the problem ofcausal inference. In this section, we start by brieﬂy presenting the generalidea of standard regression trees used for prediction, after which we describehow causal trees and causal random forests work.The idea of regression trees is to partition (or split) the data into groupsbased on the values of the covariates. The groups that are eventuallyobtained are referred to as leaves. First, one starts with the whole data setas one group. Then, for each value of each covariate, the regression treealgorithm forms candidate splits, by placing all observations that have acovariate value that is lower than than the current value in the left leaf, andall observation for which their covariate value is greater than the currentvalue in the right leaf. Among all these candidate splits, the one that isimplemented is the one that minimizes an in-sample criterion function, suchas the mean squared error (MSE) of the outcome variable within a leaf. For each of the two new leaves, the algorithm repeats the procedure until astopping rule is reached, resulting in a tree-format partition of the data.Using the terminal leaves, when the purpose is prediction, the outcomevariables of out-of-sample observations can be predicted by determiningwhich terminal leaf a new observation belongs to, based on the values ofthe covariates, and assigning as its predicted outcome the mean of theoutcomes in that leaf. This mean squared error is computed as the sum of the squared diﬀerences betweenthe outcomes of each unit within a leaf and the mean of these units in the leaf. The stopping criteria can be for example: a pre-speciﬁed maximum number of leaves,the iteration when the minimizing split gives a covariate over which the observationshave been already split by, or the iteration when the proposed split does not decreasethe mean squared error any further. causal random forest method of Wager and Athey(2018) which builds on the causal tree method of Athey and Imbens (2016).For the causal tree, ﬁrst, a percentage p from the sample N is drawn withoutreplacement. Then, the subsample n = p ∗ N is further randomly split inhalf to form a training sample n tr and an estimation sample n e . Usingonly the training sample n tr , for each value of each covariate candidatesplits are formed and a regression tree as described above is constructed.The key diﬀerence in the causal case compared to the prediction case isthe objective function that is optimized when determining the split to beimplemented.Due to the fundamental problem of causal inference, directly trainingmachine learning methods on the diﬀerence Y i (1) − Y i (0), i.e., the diﬀer-ence of the outcomes that observation i would have experienced with andwithout the treatment, is not possible, as we do not observe both outcomesfor any individual unit. Thus, instead of minimizing an infeasible MSE,Athey and Imbens (2016) propose to maximize a criterion function that re-wards a split that increases the variance of treatment eﬀects across leavesand penalizes a split that increases within-leaf variance. The goal is toaccurately estimate treatment eﬀects within leaves, while preserving het-erogeneity across leaves. The split is performed if it increases the criterionfunction, compared to no split. When no more splits can be done, the treeconstructed based on the ﬁrst subsample is deﬁned.The subsequent step involves turning to the estimation sample n e , andbased on the covariates, sorting each observation in this sample into thesame tree. Using only the estimation sample, the treatment eﬀect in eachleaf is computed as ˆ τ l = ¯ y lt − ¯ y lc i.e., the mean outcome diﬀerence betweentreated ( t ) and control ( c ) observations within a leaf ( l ). The ﬁnal stepconsists in returning to the full sample of N observations, examining towhich leaf each observation belongs based on the values of their covariates,and assigning that leaf’s treatment eﬀect as the predicted treatment eﬀectof the observation. Given that estimates from a single tree can have a highvariance, the whole algorithm described above is repeated for a number of B subsamples on which a number of B trees are obtained that eventuallyform a causal random forest. The predicted treatment eﬀect for each unitwill be the average of predictions for that particular observation across the23rees.Notice that independent samples are used for: i) growing the tree (split-ting the data), and ii) estimating treatment eﬀects within each leaf of thetree. This property is called honesty . Honesty leads to two desirable char-acteristics: it reduces bias from overﬁtting, and it makes the inference valid,since the asymptotic properties of the treatment eﬀect estimates are thesame as if the structure of the tree had been exogenously given. Wagerand Athey (2018) establish consistency and the ﬁrst asymptotic normalityresults for random forests which are then extended for the causal setting.For valid conﬁdence intervals, a consistent estimator of the asymptotic vari-ance is proposed, based on an inﬁnitesimal jackknife for random forests.Further details regarding the tuning parameters of the causal random for-est are provided in Section A.4 of the Appendix.

Description of Original Analysis.

In this section we revisit and furtheranalyze the study by DellaVigna and Kaplan (2007). This paper examinesthe impact of media bias on voting outcomes. Speciﬁcally, it analyzes theimpact of the entry of a conservative cable television channel, Fox News,on the Republican Party’s vote share in the United States. To identify thecausal eﬀect of Fox News on voting, the authors investigate whether townswhere Fox News became available between 1996 and 2000 experienced anincrease in the vote share for the Republican Party in Presidential electionsduring the same time period. The estimation is performed on a data setat the town level, comprising information on 9256 towns.We consider the main outcome variable, i.e. the change in the voteshare for the Republican party between 1996 and 2000. The treatmentvariable is a dummy indicating whether Fox News had become available Sample splitting, in general, can be ineﬃcient as part of the data is not used.However, this loss of precision does not happen in the case of causal random forests.This is because although no observation is allowed to be used within the same tree forboth partitioning the covariate space and estimation, when the data is subsampled andthe forest is obtained based on many trees, each individual unit will appear in both thetraining sample and the estimation sample of some tree. DellaVigna and Kaplan (2007) ﬁnd a positive eﬀect of Fox News onthe Republican vote share. Moreover, they explore heterogeneity along aselected set of town characteristics: the number of available cable channels,the share of urban population, and whether the town is in a swing orRepublican district. They do this by adding to the regression interactioneﬀects of these covariates with the treatment variable. Causal Forest Analysis.

We perform the HTE analysis using thecausal random forest method. Exploring heterogeneous eﬀects is importantfor this study, in order to understand whether there are town or districtcharacteristics that act as eﬀect modiﬁers. While the average eﬀects areinformative for the impact of Fox News on the whole sample, it is often thecase that treatment eﬀects are not homogeneous. It is possible that theeﬀect of Fox News was concentrated in some areas only. Understandingbetter the characteristics of the areas which saw the strongest and weakestresponses can shed light on the mechanisms. The aim of this exercise is two-fold. First, we take an agnostic view about the nature of heterogeneity, andwe investigate whether there are town or district characteristics which aretreatment eﬀect modiﬁers. Second, we examine whether the HTE analysisfrom the original paper matches the results from the causal ML methods.We focus on one of the two preferred speciﬁcations from the originalpaper: the one that includes district ﬁxed eﬀects. We present results fortwo versions of the causal random forest, which account for district-leveleﬀects in diﬀerent ways. In the ﬁrst set of results, we include in the analysisdummy variables indicating the congressional district where the town islocated. In the second set of results, we implement a cluster-robust versionof the random forest developed by Athey and Wager (2019), where we treateach district as a separate cluster. The advantage of the cluster-robustcausal forest is that it does not assume that clusters have an additive eﬀecton the outcome. Further details on the clustered-robust causal forest andtuning parameter values used for the analysis are discussed in Section A.4 Further details on the regressions and on the control variables in DellaVigna andKaplan (2007) are described in Section A.4 of the Appendix. The ﬁndings are reported in Table 6 of the original paper. (1) (2)District dummies Cluster-robustFox News eﬀect (ATE) 0.0065 0.0065(0.0016) (.0027)Fox News eﬀect above median 0.013 0.0072(0.0024) (0.0028)Fox News eﬀect below median -0.0033 0.0044(0.0021) (0.0048)95% CI for the diﬀerence (0.01009, 0.02255) (-0.00806, 0.01374)Observations 9256 9256

Notes:

This table reports the estimated average treatment eﬀect and a test for overallheterogeneity using the causal forest. Standard errors are reported in parentheses.***, ** and * * indicate signiﬁcance at the 1%, 5% and 10% levels respectively. of the Appendix.We begin by discussing the average treatment eﬀect. The results arepresented in Table 4. As in the original analysis, we ﬁnd a positive andsigniﬁcant eﬀect of Fox News on the Republican vote share, both whenincluding district dummies, and when implementing the clustered-robustcausal forest; however, the standard error in the clustered forest is larger.Our results suggest that in towns where Fox News became available theRepublican party obtained a higher vote share by 0.65 percentage pointson average, compared to towns where Fox News was not available. TheATE estimates are similar to the original paper estimates, which rangebetween 0.004 and 0.007 (reported in Table 4 of DellaVigna and Kaplan,2007, columns 4-7).Next, we want to assess whether the causal forest can recover hetero-geneity of treatment eﬀects. As pointed out in Athey and Wager (2019),we can group observations according to whether their estimated out-of-bagconditional average treatment eﬀect (CATE) is above or below the medianCATE, and we can estimate the average treatment eﬀect separately forthese two subgroups. These are reported in Table 4 as

Fox News eﬀectabove median and

Fox News eﬀect below median . The diﬀerence betweenthe two subgroup estimates is large when including district dummies, sug-gesting that there is potential for heterogeneity, and it is statistically sig-niﬁcant, as indicated by the fact that the 95% conﬁdence interval for the26iﬀerence between the two estimates does not contain zero (see column 1 ofTable 4). However, the same heuristic test for the clustered-robust forestdoes not detect signiﬁcant heterogeneity in the treatment eﬀect. This couldindicate that heterogeneity in the model with district dummy variables isoverstated, because the dummy variables cannot appropriately capture thedistrict-speciﬁc eﬀects. The cluster-robust causal forest oﬀers a more ﬂex-ible way to capture district-speciﬁc eﬀects, and may be more suitable inthis case. Although the results of the test for overall heterogeneity are mixed, itis still possible for heterogeneity to be present along some of the covariates.Hence, we investigate whether any of the included covariates are possiblesources of heterogeneity. To do this, for each variable, we split the samplein two parts, based on whether the value of the covariate of interest is belowand above the median, and we estimate the average treatment eﬀect forthe two subsamples. Table 5 reports the HTE results along the variablesthat appear to be signiﬁcant determinants of heterogeneity in both speci-ﬁcations, while B.5 and B.6 report the results for the remaining variables.In addition, to gain further insight into which variables are more importantfor heterogeneity, we compute a measure of variable importance (Athey andWager, 2019). Tables B.7 and B.8 report the variable importance measurefor the covariates included in the district dummy variable speciﬁcation andfor the clustered-robust forest, respectively. We note that for both speci-ﬁcations, the variable importance measure is decreasing smoothly and wedo not observe any variable that clearly stands out in terms of importance.Our results in Table 5 show that three variables appear to be signiﬁcantdeterminants of heterogeneity (at least at the 10% level) in both speciﬁca-tions: the change in employment between 1990 and 2000, the share of thepopulation with education level equal to high school degree, and the 10thdecile in number of cable channels available. We observe that the eﬀectof Fox News on Republican voting is stronger in towns that experienceda smaller increase in the employment rate between 1990 and 2000. Thisﬁnding may relate to the phenomenon of economic voting, i.e. the fact Athey and Wager (2019) ﬁnd a similar result in their application, when comparingthe causal forest without clustering with the cluster-robust version. See Section A.4 of the Appendix for details on how this measure is constructed. (1) (2) (3)CATE below median CATE above median p -value diﬀerence Panel A: District dummies

Employment rate, diﬀ. btw. 2000 and 1990 0.00928 0.00064 0.00656(0.00244) (0.00203)Share high school degree 2000 0.00805 -7e-05 0.00884(0.00226) (0.00213)Decile 10 in no. cable channels available 0.00877 -0.0044 6e-05(0.00192) (0.00264)

Panel B: Cluster-robust

Employment rate, diﬀ. btw. 2000 and 1990 0.00938 2e-04 0.06885(0.00254) (0.00436)Share high school degree 2000 0.00859 -0.00179 0.05296(0.00303) (0.00442)Decile 10 in no. cable channels available 0.00857 -0.00495 0.02033(0.00289) (0.00506)

Notes:

This table reports the eﬀect of Fox News on the Republican vote sharefor towns with values below (column 1) and above (column 2) the median of eachvariable. Column 3 presents the p -value for the null of no diﬀerence between theestimates in columns 1 and 2. Standard errors are reported in parentheses. that voters tend to reward incumbents during periods of economic pros-perity (e.g. Fair, 1978; Kramer, 1971; Lewis-Beck and Stegmaier, 2000;Pissarides, 1980). Areas that experienced lower economic growth (and asmaller increase in employment) may have been more easily persuaded tovote Republican in 2000, since prior to the Presidential election of 2000 aDemocratic President (Bill Clinton) had been in power for two consecutivemandates. Moreover, we observe a larger eﬀect of Fox News in towns wherethe share of population with education level equal to high school degree isbelow median. We also ﬁnd a larger positive eﬀect of Fox News in townswhere the 10th decile in the number of cable channels is below median,while the eﬀect is negative and insigniﬁcant in towns where this variable isabove median. Next, we investigate whether the ﬁndings regarding heterogeneity fromthe original paper are conﬁrmed with the causal forest. DellaVigna andKaplan (2007) found a larger eﬀect of Fox News on the Republican voteshare in towns with a smaller number of cable channels available whenincluding district ﬁxed eﬀects. While we do not observe signiﬁcant hetero-geneity along this variable, our results for the 10th decile in the number The median value for the 10th decile in number of cable channels is zero; hence,towns with value of this variable above median correspond to towns that are in the topdecile in terms of number of cable channels available.

28f cable channels are in line with the ﬁndings of the original analysis, andhence suggest that the eﬀect of Fox News diminishes in the presence ofhigher competition in cable channels. It is also interesting to note that thenumber of cable channels emerges as the variable with the highest impor-tance score in both speciﬁcations, which further points to the importance ofthis variable for heterogeneity. When investigating heterogeneity along thepolitical orientation of the district, we conﬁrm the ﬁndings of DellaVignaand Kaplan (2007): we observe no signiﬁcantly diﬀerent eﬀect for swingdistricts, and we obtained mixed results for Republican districts, as weﬁnd a signiﬁcantly smaller eﬀect of Fox News in Republican districts (atthe 10% level) when including district dummies, but not with the cluster-robust forest. However, in contrast to the original analysis, we do notﬁnd a signiﬁcant diﬀerence in the eﬀect of Fox News in rural versus urbantowns, despite this being the only heterogeneity result that is robust in allspeciﬁcations in DellaVigna and Kaplan (2007).In conclusion, our analysis of the HTE of Fox News on Republicanvoting conﬁrms some of the ﬁndings from DellaVigna and Kaplan (2007),namely the presence of heterogeneity along the number of cable channelsand no robust heterogeneous eﬀects for districts with diﬀerent political ori-entations, but as opposed to the original paper it does not show diﬀerenteﬀects for urban and rural areas. The analysis with the causal forest furtheruncovers additional heterogeneity that was previously unexplored, such asa larger eﬀect in towns that experienced a smaller increase in the employ-ment rate, and a larger eﬀect in towns with a lower share of populationwith high school degree. Finally, including district dummy variables re-sults in the causal forest detecting more heterogeneity in treatment eﬀectscompared to the cluster-robust version, both when implementing the over-all heterogeneity test and when analysing the HTE in terms of individualcovariates. However, the model with district dummy variables could over-state the heterogeneity compared to the cluster-robust forest if the districtdummies do not appropriately capture the district-speciﬁc eﬀects. Thispoints to the need of a more careful treatment of the issue of clustered ob-servations when employing causal random forests for empirical applications DellaVigna and Kaplan (2007) found mixed results for Republican districts in dif-ferent speciﬁcations.

This section focuses on the analysis of HTE for the eﬀect of a teacher train-ing intervention (Loyalka et al., 2019) using the generic machine learningmethod (Chernozhukov et al., 2018b).

A diﬀerent causal ML approach for HTE is the generic machine learningmethod of Chernozhukov et al. (2018b). To make inference possible, themethod does not focus directly on the HTEs, but on features of HTEssuch as: the best linear predictor of the heterogeneous eﬀects (BLP), thegroup average treatment eﬀects (GATES) sorted by the groups induced bymachine learning proxies, and the average characteristics of the units inthe most and least aﬀected groups, or classiﬁcation analysis (CLAN). Thegeneric machine learning method is thus useful for empirical work as: (1) itallows detection of heterogeneity in the treatment eﬀect, (2) computes thetreatment eﬀect for diﬀerent groups of observations (such as least aﬀectedor most aﬀected groups), and (3) describes which covariates are correlatedthe most with the heterogeneity.The approach is based on random splitting of the data into an auxil-iary and a main sample. The two samples are approximately equal in size.Based on the auxiliary sample, a ML estimator, called proxy predictor,is constructed for the conditional average treatment eﬀect (CATE). Anygeneric ML method can be used for this approximation (e.g., elastic net,random forest, neural network, etc.). The proxy predictors are possiblybiased and consistency is not required. We simply take them as approxi-mations and use them to estimate and make inference on features of theCATE. Based on the main sample and the proxy predictors, we can com-pute the estimates of interest: BLP, GATES and CLAN, and then makeinference relying on many splits of the data in auxiliary and main samples.We give a brief description on how the method works in practice. Let Y

30e the outcome of interest, D the binary treatment variable, and Z a vectorof covariates. Deﬁne b ( Z ) = E [ Y (0) | Z ], the baseline conditional averageand s ( Z ) = E [ Y (1) | Z ] − E [ Y (0) | Z ], the conditional average treatmenteﬀect (CATE). Using the auxiliary sample we obtain ML estimators (orproxy predictors) for the baseline conditional average and the conditionalaverage treatment eﬀect. As mentioned above, these are possibly biasedpredictors and consistency is not required. Then, for each unit in the mainsample, we compute the predicted baseline eﬀects, B ( Z ) and the predictedtreatment eﬀects, S ( Z ). Note that the predicted treatment eﬀects, S ( Z ),are obtained as the diﬀerence between the predictions for the treatmentgroup model and the control group model. Following the notation fromChernozhukov et al. (2018b), the BLP parameters are obtained using themain sample, by estimating the following regression by weighted OLS, withweights 1 / ( p ( Z )(1 − p ( Z )): Y = α (cid:48) X + β ( D − p ( Z )) + β ( D − p ( Z ))( S ( Z ) − S ( Z )) + (cid:15), (3)where X = [1 , B ( Z )], p ( Z ) = P [ D = 1 | Z ] is the propensity score, and S ( Z ) is the average of the predicted treatment eﬀect estimates on the mainsample. The control B ( Z ) is included to improve eﬃciency. Note that thecomponent ( D − p ( Z )) is part of the regressor ( D − p ( Z ))( S ( Z ) − S ( Z )).Thus, it orthogonalizes this regressor to all other covariates that are func-tions of Z . The coeﬃcient β gives the average treatment eﬀect, while β quantiﬁes how well the proxy predictor approximates the treatment hetero-geneity. If β is diﬀerent from zero, it means that there exists heterogeneityin the treatment eﬀects.Once we obtain the predicted treatment eﬀects, we can divide the ob-servations from the main sample in group: G , G , . . . , G K , based on theirtreatment eﬀects. In our empirical applications, we choose K = 5, suchthat group G contains the observations with the lowest 20% treatment ef-fects and G contains observations with the highest 20% treatment eﬀects.Then, using again the main sample, we obtain the sorted group average31reatment eﬀects by estimating the weighted regression: Y = α (cid:48) X + K (cid:88) k =1 γ k ( D − p ( Z )) · G k ) + ν, (4)where 1( G k ) is an indicator function for whether an observation is in group k , and where the weights are the same as in (3). The parameters γ k givethe average eﬀect in each group (GATES). Also, if the diﬀerence γ k − γ is signiﬁcantly diﬀerent from zero, we again have evidence for treatmenteﬀect heterogeneity between the most aﬀected and least aﬀected groups.Lastly, we can analyze the properties or characteristics of the mostaﬀected and least aﬀected groups, via Classiﬁcation Analysis (CLAN). Let g ( Y, Z ) be a vector of characteristics of an observation. We can computeaverage characteristics of the most aﬀected and least aﬀected group i.e., δ = E [ g ( Y, Z ) | G ] and δ = E [ g ( Y, Z ) | G K ], the parameters of interestbeing averages of variables directly observed. Similarly to GATES, we cancompute and make inference on the diﬀerence δ k − δ . Description of Original Analysis.

We reanalyze a large-scale random-ized experiment that investigates the eﬀect of a teacher professional de-velopment (PD) program in China on student achievement and on otherstudent and teacher outcomes. The experiment was ﬁrst studied by Loy-alka et al. (2019). Three hundred mathematics teachers, each employed indiﬀerent schools across one province, took part in the intervention. Theteachers were randomly assigned to one of the diﬀerent treatment arms: PDonly; PD plus a continuous follow-up with additional material and tasksfor the trainees; PD plus an evaluation of the extent to which the teachersremembered the content of the training sessions; or no PD (control group).The PD intervention consisted of lectures and discussions.Randomization was implemented at the school level, and in each schoolone teacher was nominated to participate in the intervention. The mainresults are obtained by estimating a cross-sectional regression, where thetreatment variable is a dummy indicating the treatment arm that the school32as assigned to. The data was collected at three points in time: at baseline,midline and endline. Outcomes are measured at midline, or endline, andthe main outcome of interest is student math achievement. The controlvariables include student characteristics, teacher characteristics and classsize. The original paper ﬁnds no signiﬁcant eﬀect of the PD intervention onstudents’ achievement after one academic year, neither for the PD inter-vention alone, nor for the PD combined with the follow up and/or theevaluation treatments. The authors also do not ﬁnd any eﬀect on otheroutcomes, such as teacher knowledge or student motivation. The lack ofeﬀectiveness of the program is attributed to several factors: the contentwas too theoretical, the PD was delivered passively, and teachers couldface constraints in the implementation of the suggested practices in theschools. Furthermore, the paper analyzes heterogeneous treatment eﬀects,by interacting the treatment variable with a number of student and teachercharacteristics: student’s household wealth, baseline achievement level, theamount of training the teacher has received prior to the intervention, stu-dent and teacher gender, whether the teacher has a college degree andwhether the teacher majored in math. The ﬁndings suggest that the eﬀectof the treatment on students’ achievement can diﬀer by teacher characteris-tic; however, no heterogeneous eﬀects are found in terms of characteristicsof students.

Generic ML Analysis.

We extend the analysis of HTE conducted inthe original paper, by implementing the generic machine learning methoddeveloped by Chernozhukov et al. (2018b). Exploring heterogeneous treat-ment eﬀects is particularly relevant for this intervention, because a smalland insigniﬁcant estimate for the ATE could hide signiﬁcant heterogene-ity. Our aim is to dig deeper into the analysis of heterogeneous treat-ment eﬀects. First, we investigate whether there is signiﬁcant heterogeneityin treatment eﬀects; second, we analyze whether causal machine learningmethods, by implementing a systematic search for heterogeneity across alarge number of covariates, can oﬀer additional insights about the charac- As Loyalka et al. (2019) show similar results when estimating the impact of the in-tervention at midline or endline, we focus on the outcome variables measured at endline. Section A.5 of the Appendix describes the regressions and the control variables. (1) (2)ATE ( β ) HET ( β )Estimate 0.002 0.65190% Conﬁdence Interval (-0.068,0.072) (0.312,0.990) p -value 1.000 0.0003Observations 10006 10006 Notes:

The estimates are obtained using neural network to produce the proxy pre-dictor

S(Z) . The values reported correspond to the medians over 100 splits. teristics of those who beneﬁted from the program and those who did not,compared to the traditional methods used in the original paper.In our analysis, we focus on the main outcome of interest, i.e. studentmath achievement. Since the results in the original paper are consistentlyclose to zero when comparing the three diﬀerent treatment arms with thecontrol group, we choose to only analyze one of the treatment arms, corre-sponding to the PD intervention plus the evaluation. The sample that weuse includes 10,006 students in 201 schools. We follow Loyalka et al. (2019)and cluster standard errors at the school level. In addition to the full setof controls included in the original paper, we also add to our analysis othervariables that could be treatment eﬀect modiﬁers: the baseline values ofa number of student-level variables, plus variables indicating teachers be-haviour in the classroom, evaluated by students at baseline. The generic method can be used in conjunction with a range of ML toolsand Chernozhukov et al. (2018b) provide two measures (Best BLP and BestGATES) to compare the performance of the diﬀerent ML methods used forthe estimation of the proxy predictors. We consider the following methods:elastic net, neural network, and random forest. Based on the results ofthe Best BLP and Best GATES analysis, reported in Table B.9 of theAppendix, we choose to further work with the neural network. These additional variables are described Section A.5 of the Appendix. In Loyalkaet al. (2019), the baseline value of the outcome variable is included as a control. Hence,the baseline characteristics described above are not included in all regressions in theoriginal analysis. However, we consider these characteristics as potential drivers ofheterogeneity; therefore, we include the baseline values of all available variables in ourheterogeneity analysis. Further details on the Best BLP and GATES measures and on the tuning parametersused in this analysis are discussed in Section A.5.

Notes:

The estimates are obtained using neural network to produce the proxy pre-dictor

S(Z) . The point estimates and 90% conﬁdence intervals correspond to themedians over 100 splits.

We ﬁrst analyze whether overall heterogeneity in treatment eﬀects canbe detected. We present results for the best linear predictor (BLP) of theCATE in Table 6. In line with the original paper, the estimated ATE,given by the coeﬃcient β , is small (the estimated impact of the PD is0.002 standard deviations) and not signiﬁcantly diﬀerent from zero. Theestimated β is instead large and signiﬁcantly diﬀerent from zero, whichindicates that there is heterogeneity in treatment eﬀects. Next, we estimatethe group average treatment eﬀects (GATES). We split the sample into ﬁvegroups, based on the quintiles of the ML proxy predictor S(Z) . This analysisreveals further insights into the extent of heterogeneity. Table B.10 of theAppendix reports the GATE in the top and bottom quintile and shows thatthe GATE in the top quintile is positive, whereas for the bottom quintilethe estimated GATE is negative. Both estimates are statistically signiﬁcantat the 10% level. The diﬀerence between the GATE for the top and thebottom quintile is signiﬁcant, which conﬁrms the presence of heterogeneityin treatment eﬀects. Additionally, Figure 1 reports the GATES estimateand the 90% conﬁdence interval for the ﬁve quintiles, as well as for the wholesample (the ATE is represented as a blue dashed line, and the conﬁdence35able 7: Teacher Training - Generic Method: Classiﬁcation Analysis (1) (2) (3)20% most aﬀected 20% least aﬀected p -value for the diﬀerenceTeacher college degree 0.039 0.800 0.000(0.019,0.059) (0.780,0.820)Teacher training hours 2.447 1.684 0.000(2.399,2.494) (1.636,1.731)Teacher ranking 0.666 0.405 0.000(0.635,0.697) (0.374,0.437)Student age 14.18 13.73 0.000(14.11,14.25) (13.65,13.80)Teacher experience (years) 16.18 13.16 0.000(15.60,16.76) (12.58,13.74)Student female 0.417 0.555 0.000(0.385,0.449) (0.523,0.587)Teacher age 37.51 35.01 0.000(37.02,38.00) (34.52,35.50)Student math score at baseline -0.029 0.169 0.005(-0.088,0.031) (0.110,0.229)Student baseline math anxiety 0.298 -0.219 0.000(0.236,0.360) (-0.281,-0.157)Class size 52.87 64.37 0.000(51.82,53.93) (63.32,65.43) Notes:

This table shows the average value of the teacher and student characteristicsfor the most and least aﬀected groups. The estimates are obtained using neuralnetwork to produce the proxy predictor

S(Z) . 90% conﬁdence intervals are reportedin parenthesis. The variables

Student math score at baseline and

Student baselinemath anxiety are normalized. The values reported correspond to the medians over100 splits. interval as two red dashed lines). Notice that for the three middle quintilesthe eﬀect of the teacher training intervention is not signiﬁcantly diﬀerentfrom zero.We then turn to analysing the possible sources of heterogeneity, byimplementing the Classiﬁcation Analysis (CLAN). Thus, we analyze furtherthe top and botton quintile in terms of ATE, for which the eﬀect of thePD intervention is positive and negative respectively. In particular, wecompare the student and teacher characteristics in the two groups. As alarge number of covariates is available, we focus on the ten covariates forwhich the correlation with the proxy predictor, S ( Z ), is highest, reportedin Table 7. Table B.11 in the Appendix shows the CLAN analysis for theremaining covariates. We start by analyzing the characteristics of the teachers whose studentsbelong to the least and most aﬀected group. Interestingly, the variable in- Table B.12 reports the correlation for each of the covariates with S ( Z ). The directionof the eﬀect is consistent with what was found in the original analysis:the students in the top quintile are more likely to have been taught by ateacher who does not have a major in math, compared to the students inthe bottom quintile. It is also interesting to note that the number of hoursof training that the teacher received prior to the intervention, which is notfound to be a determinant of heterogeneity in the original paper, is higherin the most aﬀected group compared to the least aﬀected group. Thismay reﬂect the fact that teachers who have had more training in the pastmay be able to better implement the suggestions from the PD intervention.Table 7 shows that teacher rank, experience and age are higher in the mostaﬀected group compared to the least aﬀected group. This is consistent withthe existence of a similar mechanism: teachers who have more experience When considering the PD plus follow-up, the authors ﬁnd a signiﬁcant negativeeﬀect on the scores of students whose teachers majored in math relative to the scores ofthose whose teachers did not. The variable indicating teacher training hours previous to the intervention is acategorical variable, based on the terciles of the continuous variable. As the continuousvariable is not included in the replication data set of the original paper, for our analysiswe use this categorical variable, which takes values 1 to 3, where 3 is the top tercile inthe number of training hours. student characteristics are poten-tial drivers of heterogeneity. In contrast to the ﬁndings in Loyalka et al.(2019), who did not ﬁnd heterogeneity in terms of student features, we ﬁndthat students in the most aﬀected group diﬀer in terms of several char-acteristics compared to students in the least aﬀected group. Among themost correlated with the heterogeneity score (listed in Table 7) are studentage and gender: students in the most aﬀected group are on average abouthalf a year older than students in the least aﬀected group, and the mostaﬀected group includes a larger share of male students. Additionally, stu-dents in the most aﬀected group, on average, have a lower baseline mathscore, and tend to be more anxious about math. Thus, teacher PD could bemore beneﬁcial for weaker students, and for students who are more anxiousabout the subject. Finally, class size appears to be a possible determinantof heterogeneity: students who beneﬁt more from the PD tend to be insmaller classes. This result suggests that in smaller classes it may be easierfor teachers to implement some of the practices introduced during the PDtraining. For instance, Loyalka et al. (2019) mention having students worktogether in small groups as one of the techniques that were suggested inthe PD; this technique is likely to be easier to implement in smaller classes.In conclusion, our analysis conﬁrms the presence of heterogeneous ef-fects of the teacher PD intervention, and uncovers a rich set of potentialdeterminants of heterogeneity. With the GATES analysis, we are able toshow that the achievement of students belonging to the bottom quintile isnegatively aﬀected by the intervention, while the achievement of studentsin the top quintile is positively aﬀected by the intervention. This conﬁrmswhat was suggested by Loyalka et al. (2019): that there is a group of stu-dents who beneﬁts from the intervention, and a group who does not. Inaddition, the GATES analysis shows that the eﬀect is not signiﬁcantly dif-ferent from zero for the students belonging to the middle quintiles. Withthe CLAN analysis, we can obtain a clearer picture of the characteristics ofthe groups who beneﬁt and who do not from the intervention, compared to38he original HTE analysis. In line with Loyalka et al. (2019), we ﬁnd thatteacher characteristics such as having a college degree or having a major inmath are potential determinants of heterogeneity. However, our study un-covers additional diﬀerences (that were not identiﬁed in the original paper)between the least and the most aﬀected groups, in terms of both teacherand student characteristics, such as teacher’s rank, experience, age andnumber of training hours, as well as student’s gender, age, baseline mathscore, baseline math anxiety and class size.

Our main message is that appropriately combining predictive methods withcausal questions adds value to traditional methods and should be moreoften explored in applied research. We argue that in each revisited studythe researcher would have beneﬁted from employing causal ML methodsand would have gained additional insights not provided by standard causalinference tools.When the researcher works with an observational study and is interestedin the ATE, causal machine learning methods can improve the credibilityof causal analysis by making the unconfoundedness assumption more plau-sible – as causal ML methods control for potential confounders in a moreﬂexible way; implement a systematic model selection; and are robust ap-proaches for sensitivity analysis. If the researcher is interested in HTE,causal machine learning methods can ensure that relevant heterogeneityand its determinants are not missed, or falsely discovered due to multiplehypothesis testing issues. Also, causal ML methods can be used to uncoverheterogeneity ex-post, without being bound to explore HTE only for thespeciﬁc subgroups indicated in the pre-analysis plan. Note then even if the empirical study is a randomized control trial and controllingfor confounding factors is not necessarily needed, the use of causal machine learningmethods can improve eﬃciency and provide more precise estimates with lower standarderrors and tighter conﬁdence intervals. eferences Alberto Alesina, Paola Giuliano, and Nathan Nunn. On the origins of gen-der roles: Women and the plough.

The Quarterly Journal of Economics ,128(2):469–530, 2013.S Athey and GW Imbens. Machine learning methods economists shouldknow about, arxiv. 2019.Susan Athey and Guido Imbens. Recursive partitioning for heterogeneouscausal eﬀects.

Proceedings of the National Academy of Sciences , 113(27):7353–7360, 2016.Susan Athey and Guido W Imbens. The state of applied econometrics:Causality and policy evaluation.

Journal of Economic Perspectives , 31(2):3–32, 2017.Susan Athey and Stefan Wager. Estimating treatment eﬀects with causalforests: An application. arXiv preprint arXiv:1902.07409 , 2019.Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residualbalancing: debiased inference of average treatment eﬀects in high di-mensions.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 80(4):597–623, 2018.Marianne Bertrand, Bruno Cr´epon, Alicia Marguerie, and Patrick Pre-mand. Contemporaneous and post-program impacts of a public worksprogram: Evidence from cˆote d’ivoire. Working Paper, 2017.Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo,Christian Hansen, and Whitney Newey. Double/debiased/neyman ma-chine learning of treatment eﬀects.

American Economic Review , 107(5):261–65, 2017.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo,Christian Hansen, Whitney Newey, and James Robins. Double/debiasedmachine learning for treatment and structural parameters.

The Econo-metrics Journal , 21(1):C1–C68, 2018a.40ictor Chernozhukov, Mert Demirer, Esther Duﬂo, and Ivan Fernandez-Val. Generic machine learning inference on heterogenous treatment ef-fects in randomized experiments. Working Paper, National Bureau ofEconomic Research, 2018b.Kyle Colangelo and Ying-Ying Lee. Double debiased machine learningnonparametric inference with continuous treatments. arXiv preprintarXiv:2004.03036 , 2020.Jonathan Davis and Sara B Heller. Using causal forests to predict treat-ment heterogeneity: An application to summer jobs.

American EconomicReview , 107(5):546–50, 2017a.Jonathan MV Davis and Sara B Heller. Rethinking the beneﬁts of youthemployment programs: The heterogeneous eﬀects of summer jobs.

Re-view of Economics and Statistics , pages 1–47, 2017b.Stefano DellaVigna and Ethan Kaplan. The fox news eﬀect: Media bias andvoting.

The Quarterly Journal of Economics , 122(3):1187–1234, 2007.Tatyana Deryugina, Garth Heutel, Nolan H Miller, David Molitor, andJulian Reif. The mortality and medical costs of air pollution: Evidencefrom changes in wind direction.

American Economic Review , 109(12):4178–4219, 2019.Simeon Djankov, Tim Ganser, Caralee McLiesh, Rita Ramalho, and AndreiShleifer. The eﬀect of corporate taxes on investment and entrepreneur-ship.

American Economic Journal: Macroeconomics , 2(3):31–64, 2010.Ray C Fair. The eﬀect of economic events on votes for president.

TheReview of Economics and Statistics , pages 159–173, 1978.Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networksfor estimation and inference: Application to causal eﬀects and othersemiparametric estimands. arXiv preprint arXiv:1809.09953 , 2018.Gene M Grossman and Elhanan Helpman.

Innovation and growth in theglobal economy . MIT press, 1991. 41ennifer L Hill. Bayesian nonparametric modeling for causal inference.

Journal of Computational and Graphical Statistics , 20(1):217–240, 2011.Kosuke Imai, Marc Ratkovic, et al. Estimating treatment eﬀect heterogene-ity in randomized program evaluation.

The Annals of Applied Statistics ,7(1):443–470, 2013.Guido W Imbens and Donald B Rubin.

Causal inference in statistics,social, and biomedical sciences . Cambridge University Press, 2015.Guido W Imbens and Jeﬀrey M Wooldridge. Recent developments in theeconometrics of program evaluation.

Journal of Economic Literature , 47(1):5–86, 2009.Michael C Knaus, Michael Lechner, and Anthony Strittmatter. Heteroge-neous employment eﬀects of job search programmes: A machine learningapproach.

Journal of Human Resources , pages 0718–9615R1, 2020.Gerald H Kramer. Short-term ﬂuctuations in us voting behavior, 1896–1964.

American Political Science Review , 65(1):131–143, 1971.Michael S Lewis-Beck and Mary Stegmaier. Economic determinants ofelectoral outcomes.

Annual Review of Political Science , 3(1):183–219,2000.John A List, Azeem M Shaikh, and Yang Xu. Multiple hypothesis testingin experimental economics.

Experimental Economics , pages 1–21, 2016.Prashant Loyalka, Anna Popova, Guirong Li, and Zhaolei Shi. Does teachertraining actually work? evidence from a large-scale randomized evalua-tion of a national teacher training program.

American Economic Journal:Applied Economics , 11(3):128–54, 2019.Nathan Nunn. Relationship-speciﬁcity, incomplete contracts, and the pat-tern of trade.

The Quarterly Journal of Economics , 122(2):569–600,2007.Nathan Nunn and Daniel Treﬂer. The structure of tariﬀs and long-termgrowth.

American Economic Journal: Macroeconomics , 2(4):158–94,2010. 42hristopher A Pissarides. British government popularity and economicperformance.

The Economic Journal , 90(359):569–581, 1980.Frederic L Pryor. The invention of the plow.

Comparative Studies in Societyand history , 27(4):727–743, 1985.Vira Semenova, Matt Goldman, Victor Chernozhukov, and Matt Taddy.Orthogonal machine learning for demand estimation: High dimensionalcausal inference in dynamic panels. arXiv preprint arXiv:1712.09988 ,2018.Anthony Strittmatter. What is the value added by using causal machinelearning methods in a welfare experiment evaluation? Working Paper,2019.Xiaogang Su, Chih-Ling Tsai, Hansheng Wang, David M Nickerson, andBogong Li. Subgroup analysis via recursive partitioning.

Journal ofMachine Learning Research , 10(Feb):141–158, 2009.Stefan Wager and Susan Athey. Estimation and inference of heterogeneoustreatment eﬀects using random forests.

Journal of the American Statis-tical Association , 113(523):1228–1242, 2018.Achim Zeileis, Torsten Hothorn, and Kurt Hornik. Model-based recursivepartitioning.

Journal of Computational and Graphical Statistics , 17(2):492–514, 2008. 43

Details on Revisited Studies and Imple-mentation of Causal ML Methods

A.1 The Eﬀect of Corporate Taxes on Investmentand Entrepreneurship

Details on the Original Analysis.

In Djankov et al. (2010), the baselineregression equation is the following: y c = α + βtaxes c + X c Γ + (cid:15) c , where c is an index for country. Four diﬀerent outcome variables are ex-amined: investment as a percentage of GDP, FDI as a percentage of GDP,business density per 100 people, and the average entry rate (measured aspercentage). Three separate measures of corporate taxes are considered.The ﬁrst is the statutory corporate tax rates, which is the marginal taxrate on income a corporation has to pay assuming the highest tax bracket.The second is the actual ﬁrst-year corporate income tax liability of a newcompany, relative to pre-tax earnings. The third is the tax rate which takesinto account actual depreciation schedules going ﬁve years forward.The term X c denotes the control variables, aimed at capturing the ef-fect of potential confounding factors. This is an observational study, inwhich tax rates are not randomly assigned across countries. It is likelythat there will be factors which are correlated with both the treatment(corporate tax rates), and with the outcomes (measures of entrepreneur-ship and investment). To deal with this issue, the eﬀect of corporate taxeson the outcomes is estimated by adding several control variables to theregressions. The ﬁrst set of control variables are measures of other taxes:the sum of other taxes payable in the ﬁrst year of operation, VAT tax, salestax, and the highest national rate on personal income tax. The second setof covariates include the logarithm of the number of tax payments made(which is used as a measure of the burden of tax administration), an in-dex of tax evasion, and the number of procedures to start a business. Thethird set of controls are institutional variables: a property rights index,an indicator of the rigidity of employment laws, a measure of a country’s44penness to trade, and the log of per capita GDP. The fourth set of covari-ates are measures of inﬂation: average inﬂation in the previous ten years,and seigniorage, which captures government reliance on printing money. Details on the DML Analysis.

The results are based on 100 splitsand 2 folds. The point estimates are calculated as the median across splits,and the standard errors are adjusted for the variability across sample splitsusing the median method, see Chernozhukov et al. (2018a).We use two hybrid ML methods in our analysis. Ensemble is a weightedaverage of estimates from lasso, boosting, random forest and neural net-works, the weights being chosen to give the lowest average mean squaredout-of-sample prediction error. Best chooses the best method for estimat-ing the nuisance functions in terms of the average out-of-sample predictionperformance among all the other methods.The lasso estimates are based on (cid:96) -penalized regressions with the penaltyparameter obtained through 10-fold cross-validation. As controls, for thelasso we consider the set of all raw covariates as well as ﬁrst-order interac-tions. For the rest of the ML methods, we consider the set of raw covariatesas controls. The regression tree method ﬁts a CART (classiﬁcation and re-gression tree) tree with a penalty parameter (that restricts the tree fromoverﬁtting and makes sure that only splits that are considered “worthy” areimplemented) obtained with 10-fold cross validation. The random forestestimates are obtained using 1000 trees, while the Boosting estimates areobtained with 1000 boosted regression trees. For the boosting, the mini-mum number of observations in trees’ terminal nodes is set to 1 and thebag.fraction parameter is set to 0.5, except for Panel D of Table 1, where itis increased to 0.8. For the neural networks we used 2 neurons and a decayparameter of 0.01; the activation function is set to the linear function. For the analysis of nonlinear terms with lasso , we examine the estimatednuisance functions for the outcome average entry rate and the treatmentvariable ﬁrst-year eﬀective tax rate . In our analysis, for the estimation ofthe two nuisance functions, the lasso selects among the simple covariates,and their two-way interactions. It is interesting to note that a large In general, the activation function can be set to the linear function for regressionproblems (when the outcome is continuous) and to the logistic function for classiﬁcationproblems (when the outcome is categorical). For the lasso estimation, depending on the application, other nonlinear terms could m ( · ) and in the outcome nuisance function ˆ g ( · ). The lasso coeﬃcientsare calculated as the median coeﬃcients across splits. Among these, someappear in both nuisance functions (the coeﬃcients of the common termsare depicted in purple in Figure B.1). A particular issue that appears withthe lasso when the interest is on analyzing the interaction terms is worthmentioning here. Since the lasso implements regularization by shrinking thesmallest coeﬃcients to zero, it is possible that interaction terms are includedin the regression, but the coeﬃcients of the raw covariates forming theinteractions are shrunk to zero. It is thus important to check whether theraw covariates forming these interactions also appear in the regression. Ifthe coeﬃcients on the raw covariates are shrunk, the coeﬃcient of the ’pure’interaction terms might not be properly captured and the found interactionterms might actually reﬂect the eﬀect of the raw covariates, diminishingthe importance of our uncovered nonlinearities. Thus, when analyzingthe relevance of the interaction terms, we are careful to only report thecoeﬃcients of the interactions for which both main eﬀects are included inthe lasso estimation. The lasso coeﬃcients of all the raw covariates arereported in Table B.1 of the Appendix.

A.2 The Eﬀect of Plough Agriculture on Gender Roles

Details on the Original Analysis.

Alesina et al. (2013) consider severalempirical strategies and data sets. They start with OLS regressions per-formed using country-level and micro-level data. Then, to tackle possibleendogeneity issues, the paper follows two approaches: ﬁrst, several poten-tial confounders are included in the regressions; second, an instrumentalvariable strategy is used. Our focus is on the country-level regressions.The baseline OLS country-level results in the original analysis (reportedin Table 4 of Alesina et al., 2013) are obtained by estimating the following be added, such as the squares of the covariates, or three-way interactions. It is important to note here that we do not make inference using the lasso coeﬃ-cients, but just analyze the magnitude of the coeﬃcients, as a measure of the covariates’importance for predicting the outcome and the treatment variables. The instrumental variable strategy is summarized in Section 2.2.2. y c = α + βplough use c + X Hc Γ + X Cc Π + (cid:15) c , where c stands for country. In the paper, three outcome variables are ex-amined as measures of gender roles: female labour force participation, atti-tudes about women’s work, and attitudes about women as leaders. The ﬁrstoutcome variable is an indicator variable that equals one if the individualis in the labor force in 2000; the second is the share of ﬁrms with a womanamong its principal owners in the period 2003-2010; ﬁnally, the third is theproportion of seats held by women in the national parliament in 2000. Thetreatment variable, plough use c , is calculated as the estimated proportionof individuals living in a country with ancestors that used the plough inpre-industrial agriculture. The vector X Hc includes historical ethnographicvariables at the country level. These controls capture the historical char-acteristics of ethnicities living in a country, and they are meant to accountfor diﬀerences between ethnicities that historically adopted the plough andthose that did not. They include: ancestral suitability for agriculture,fraction of ancestral land that was tropical or subtropical, ancestral do-mestication of large animals, ancestral settlement patterns, and ancestralpolitical complexity. The vector X Cc denotes contemporary country-levelcontrols: natural log of real per capita GDP, and its square. These areincluded as the level of economic development is believed to have an im-pact on female labour force participation, and the square of per capitaGDP is intended to capture the observed U-shaped relation between thetwo variables. Continent ﬁxed eﬀects are also added in some speciﬁcations.The extended set of controls includes additional historical and contem-porary controls. Just as with the baseline controls, the additional historicalcontrols are measures of the characteristics of the ancestors of the currentpopulation living in a country. These are: the intensity of agriculture;the proportion of subsistence provided by hunting and by the herding oflarge animals; the fraction of countries’ ancestors without land inheritancerules, with patrilocal post-marital residence rules, and with matrilocal post-marital rules; the fraction of countries’ ancestors with a nuclear and anextended family structure; the average year the ethnicities were sampled47n the Ethnographic Atlas . The contemporary controls are: years of civiland interstate conﬂicts (1816-2007); terrain ruggedness; whether a countrywas under a communist regime after WWII; the fraction of a country’spopulation with European descent; oil production per capita; agricultural,manufacturing and services shares of GDP; and the fraction of a coun-try’s population who is Catholic, Protestant, other Christian, Muslim, andHindu. Alesina et al. (2013) provide the rationale for including each ofthese controls, and details on how the variables are constructed.The geo-climatic characteristics included in the IV analysis are: terrainslope, soil depth, average temperature, average precipitation. In the origi-nal paper, the geo-climatic characteristics are added linearly, in quadraticforms, and as linear interactions.

Details on the DML Analysis.

As in the previous example, theresults are obtained with 100 splits and 2-fold cross-ﬁtting. We reportmedian estimates of the coeﬃcients across the splits, and standard errorsadjusted for the variability across sample splits using the median method.The values of the tuning parameters are the same as in the ﬁrst example.

A.3 The Eﬀect of Skill-Biased Tariﬀ on Growth

Details on the Original Analysis.

For the country-level results, Nunnand Treﬂer (2010) estimate the following regression equation: ln y c /y c = α + β SB SBτ c + X c β X + (cid:15) c , where ln y c /y c is the log annual per capita GDP growth in country c be-tween the beginning and the end of the time period considered, SBτ c isa measure of initial skill-bias of tariﬀs, and X c represents the controls.Three measures of the skill-bias of tariﬀs are used: the ﬁrst is the correla-tion between the industry tariﬀs and the industry’s skill-intensity, while thesecond and third are based on the diﬀerence between the log average tar-iﬀs in skill-intensive industries and log average tariﬀs in unskilled-intensiveindustries (the two measures diﬀer in the choice of the cut-oﬀ value for in-dustry skill-intensity, with the second using a lower cut-oﬀ than the third).The controls include: the log of the initial average level of tariﬀs in thecountry, three country characteristics measured at the initial period (the48og of GDP per capita, the log of human capital, and the log of the ratio ofinvestment-to-GDP), cohort ﬁxed eﬀects (to account for the fact that coun-tries have diﬀerent initial time periods), region ﬁxed eﬀects (accounting for10 diﬀerent regions), and two measures of initial production structure (thelog of output in skill-intensive and in unskilled-intensive industries sepa-rately).Additionally, Nunn and Treﬂer (2010) estimate the following regressionequation, using industry-level data: ln q ic /q ic = β q lnq ic + β τ lnτ ic + β E ln ¯ τ c + β SB SBτ c + X c β X + α i + (cid:15) ic , where lnq ic /q ic is the average annual log change in industry output inindustry i and country c ; lnq ic is the log of industry output in the initialperiod; τ ic is the log initial-period tariﬀ; ln ¯ τ c is the average tariﬀ, SBτ c is one of the three measures of skill-bias of tariﬀs, and α i are industry ﬁxedeﬀects. The variable X c indicates the controls which are the same as inthe country-level regressions.The original results show a strong, positive correlation between skill-biased tariﬀs and long-term per capita income growth at the country level(Table 4 in Nunn and Treﬂer, 2010). The correlation is strong also betweenthe skill bias of tariﬀs and industry output growth, with and without in-cluding the initial industry tariﬀ in the regression (Tables 5 and 6 in Nunnand Treﬂer, 2010 respectively). The fact that the size of the coeﬃcient ofskill-biased tariﬀs remains large when adding the variable initial industrytariﬀs suggests that the mechanism highlighted in the model, i.e. skill-biased tariﬀs shifting resources towards skill-intensive industries, cannotfully account for the correlation between the treatment variable and long-term growth. Nunn and Treﬂer (2010) further show, with country-levelregressions, that the model mechanism can explain up to one quarter ofthe total correlation between the skill bias of tariﬀs and long-term growth(Table 7 in the original paper). The paper then investigates other alter-native mechanisms that can explain the independent eﬀect of skill-biasedtariﬀs on output growth in Sections V, VI and VII, in the original paper. Details on the DML Analysis.

As in the previous examples, theresults are obtained with 100 splits and 2-fold cross-ﬁtting. We report49edian estimates of the coeﬃcients across splits, and standard errors areadjusted for the variability across sample splits using the median method.The tuning choices are the same as in the previous two examples exceptfor Neural Network in the country-level regressions where the estimates areobtained using 3 neurons and a decay parameter of 0.001.

A.4 The Eﬀect of Fox News on the Republican VoteShare

Details on the Original Analysis.

To produce the main results (see Ta-ble IV in the original paper), the authors estimate the following regression: v R,P resk,j, − v R,P resk,j, = βd F OXk, + Γ X k, + Γ − X k, − + Γ c C k, + θ j + (cid:15) k,j , where k denotes a town in a congressional district j . The dependent vari-able is the change in Republican vote share between the 1996 and the 2000presidential elections. The treatment variable d F OXk,j, is an indicator vari-able taking the value of 1 for towns where Fox News was available by theyear 2000, and 0 otherwise. The regression includes demographic controlsat the town level: total population, the employment rate, the share ofAfrican Americans and of Hispanics, the share of males, the share of thepopulation with some college education, the share of college graduates, theshare of high school graduates, the share of the town that is urban, themarriage rate, the unemployment rate, and average income. These con-trols are added both as levels in 2000 ( X k, ) and as changes between1990 and 2000 ( X k, − ), and aim at capturing possible confounders thatcould be correlated with both the availability of Fox News and voting. Inaddition to the demographic controls, the regression includes a set of ca-ble system features, denoted by C k, , which are potentially correlatedwith the treatment variable. These are deciles in the number of chan-nels provided and in the number of potential subscribers. Finally, ﬁxedeﬀects (congressional district ﬁxed eﬀects or county ﬁxed eﬀects) denotedby θ j , are added to capture trends in voting that might be common to ageographical area and also correlated with Fox News availability. In theoriginal analysis, standard errors are clustered at the cable company level.50he paper also tests whether Fox News increased voter turnout and theRepublican vote share in the Senate election.The results from the heterogeneity analysis of DellaVigna and Kaplan(2007) show a negative but insigniﬁcant eﬀect for swing district. Addition-ally, the authors ﬁnd that the eﬀect of Fox News on the Republican voteshare is signiﬁcantly smaller in towns where the number of cable channelsis higher, suggesting a negative impact of higher competition on the eﬀectof Fox News. Moreover, the eﬀect is found to be signiﬁcantly larger inmore urban areas and smaller in more Republican districts. Regarding thelatter two ﬁndings, the authors point out that in rural areas and in Re-publican districts the Republican party tends to have a larger vote base tobegin with, thus diminishing the share of voters that could potentially beconvinced by Fox News. Out of the four eﬀects, only the diﬀerential eﬀectfor urban population is signiﬁcant in both main speciﬁcations (county anddistrict ﬁxed eﬀects). The interaction of the treatment variable with theRepublican district variable is only signiﬁcant when including county ﬁxedeﬀects, but not when including district ﬁxed eﬀects, and the opposite istrue for the interaction of the treatment with the number of cable chan-nels. The authors also make a note that they ﬁnd a smaller eﬀect in theSouth, but this result is not reported in their paper, and we do not focuson it in our analysis. Details on the Analysis with the Causal Random Forest.

Thereare a number of parameters to be set in the causal random forest, algorithmsuch as the number of trees, the size of the subsample, and the minimumnumber of control and treatment units in each leaf. The number of trees is typically chosen as a trade-oﬀ between computation times and the testerror rate. A larger number of trees reduces the Monte Carlo error due tosubsampling, which means that the treatment eﬀect predictions will varyless across diﬀerent forests. A higher number of minimum treatment andcontrol units will lead to bigger leaves and a less deep tree. This willpredict less heterogeneity. A smaller number will increase the variance asthe treatment eﬀect will be estimated with too few observations in a givenleaf. Setting a smaller subsample size will decrease the dependence acrosstrees, but will increase the variance of each estimate in a tree. The sizes ofthe training and estimation samples are typically ﬁxed to 50% of the drawn51ubsample. If there are reasons to allocate more observations to one or theother sample, these proportions can be changed. In the algorithm, thereis also a standard parameter for the number of covariates considered for asplit , before building a tree, within a forest. This is often set to (cid:98) K/ (cid:99) in the literature, where K is the total number of covariates.In our analysis, the tuning parameter values are optimised via cross-validation, except the number of trees which is set to 2000. We performedsensitivity analysis with diﬀerent values for the number of trees (1000 and5000). The results are available upon request.In the cluster-robust causal forest (Athey and Wager, 2019), when con-structing the subsample on which the forest is trained, we do not directlydraw observations, but clusters. In addition, in the ﬁnal step, when con-structing the predicted out-of-bag treatment eﬀects, an observation is con-sidered out-of-bag if its cluster was not drawn in the subsample.The variable importance measure reported in Tables B.7 and B.8 takesinto account the proportion of splits over all trees for a particular variable,weighted by depth, and it is useful for describing which covariates inﬂuencethe most the ﬁnal estimates when employing the causal forest, as the in-terpretability of a single tree is lost in this case. Recall from the main textthat in the causal forest splits are performed if they maximize a criterionfunction that rewards splits that increase the variance of the treatment ef-fect across leaves, while penalizing splits that increases the variance withina leaf. Hence, higher values for this measure indicate higher importance interms of heterogeneity of treatment eﬀects. This makes random forests diﬀerent from bagged trees. In bagged trees the num-ber of predictors considered for a split is equal to the total number of covariates theresearcher considers, while in random forests, the number of predictors is strictly lessthan this total number. The procedure ’decorrelates’ the trees (as the trees will be lesssimilar) and the aggregation of predictions across trees will have a lower variance. .5 The Eﬀect of Teacher Training on Student Per-formance Details on the Original Analysis.

In Loyalka et al. (2019), the mainresults are obtained by estimating the following regression equation: Y i,j = α + α D j + X ij α + τ k + (cid:15) i,j , where Y i,j is the outcome, measured at midline or endline, for student i inschool j ; D j is a dummy variable indicating the treatment assignment; thevector X ij includes the control variables, measured at baseline; τ k indicatesthe block ﬁxed eﬀects. The main outcome of interest, student achieve-ment, is measured with a 35-minutes mathematics test at endline. Thefull set of control variables includes students characteristics (age, gender,parent educational attainment, household wealth), class size, and teachercharacteristics (gender, age, experience, education level, rank, a teacher cer-tiﬁcation dummy, and a dummy indicating whether the teacher majored inmath).The ﬁndings from the heterogeneity analysis suggest that the programhas a small positive eﬀect on achievement of students taught by less qual-iﬁed teachers and a negative eﬀect on students whose teachers are morequaliﬁed. In addition, some evidence of heterogeneity is found in termsof whether or not the teachers majored in math, with a negative eﬀect onachievement for those students whose teachers did major in math (this ef-fect is only found when comparing the PD plus follow up with the controlgroup).

Details on the Generic ML Analysis.

In addition to the full set ofcontrols included in the original paper, we add to our analysis the followingvariables: the baseline values of a number of student-level variables (mathself concept, math anxiety, intrinsic motivation for math, instrumental mo-tivation for math, time spent each week studying math), plus a number ofvariables indicating teachers behaviour in the classroom, evaluated by stu-dents at baseline (instructional practices of teacher, teacher care, classroom The schools were randomized within blocks. A block is deﬁned by the year ofstudy the student is enrolled in (i.e. grades 7, 8, or 9), and by the two agencies thatimplemented the intervention. Hence, the total number of blocks is six. p -values are computed based on the median of many random conditional p -values, with nominal level adjusted again for splitting uncertainty.The Best BLP and Best GATES measures are based on maximizing thecorrelation between the proxy predictor of the conditional average treat-ment eﬀect, S ( Z ), and the true conditional treatment eﬀect, s ( Z ) (seeChernozhukov et al., 2018b). Table B.9 shows that this correlation is thelargest for the Neural Network. Therefore, we carry out the HTE analysisusing the Neural Network.The values of the tuning parameters were optimized via cross-validationfor the Elastic Net and Neural Network. For the random forest they areset to default values to save on computation time. For the random forest,the number of trees is set to 2000 and the number of covariates consideredfor a split is set to K/

3, which gives a value of 8.54

Additional Tables and Figures

Table B.1: The Eﬀect of Corporate Taxes on Entrepreneurship: LassoCoeﬃcients of Raw Covariates (1) (2)Outcome: Treatment variable:Average Entry Rate First-year Eﬀective Tax RateLog of number of tax payments -0,402 0,017Procedures to start a business -0,006 0,001Seigniorage 2004 -0,003 0Other taxes -0,001 0,003Rigidity of employment 0 0Average inﬂation (1995-2004) 0 0PIT top marginal rate 0 0,01IEF Property Right Index 0 0,001VAT and sales tax 0 -0,005Tax evasion (GCR) 0,009 -0,004Log GDP pc 2003 0,011 -0,02EFW Freedom to Trade Internationally Index 0,305 -0,619

Notes:

The table shows the lasso coeﬃcients of the raw covariates, obtained byestimating the nuisance functions g ( · ) (column 1) and m ( · ) (column 2). The lassocoeﬃcients are calculated as the median over splits. a b l e B . : O n t h e o r i g i n s o f g e nd e rr o l e s : C o un tr y - l e v e l e s t i m a t e s , p a rt i a ll y li n e a r m o d e l ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : F e m a l e l a b o u r f o r ce pa r t i c i pa t i o n T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s P a n e l B : S ha r e o f ﬁ r m s w i t h f e m a l e o w n e r s h i p T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s P a n e l C : S ha r e o f po l i t i c a l po s i t i o n s h e l d b y w o m e n T r a d i t i o n a l p l o u g hu s e - . - . - . - . - . - . - . - . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f T a b l e ( c o l u m n s , , ) o f A l e s i n a e t a l. ( ) u s i n g D M L . C o l u m n r e p o r t s t h e o r i g i n a l p a p e rr e s u l t s . S t a nd a r d e rr o r s a r e r e p o r t e d i np a r e n t h e s e s . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . R o bu s t s t a nd a r d e rr o r s a r e r e p o r t e d i n c o l u m n . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . a b l e B . : T h e S tr u c t u r e o f T a r i ﬀ s a nd L o n g - T e r m G r o w t h : I ndu s tr y - l e v e l e s t i m a t e s ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : S k i ll t a r i ﬀ c o rr e l a t i o n S k ill t a r i ﬀ c o rr e l a t i o n . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l B : T a r i ﬀ d i ﬀ e r e n t i a l ( l o w c u t - o ﬀ ) T a r i ﬀ d i ﬀ e r e n t i a l ( l o w c u t - o ﬀ ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l C : T a r i ﬀ d i ﬀ e r e n t i a l ( h i g h c u t - o ﬀ ) T a r i ﬀ d i ﬀ e r e n t i a l ( h i g h c u t - o ﬀ ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f T a b l e ( c o l u m n s , , ) o f N unn a nd T r e ﬂ e r ( ) u s i n g D M L . C o l u m n ( ) r e p o r t s t h e o r i g i n a l p a p e r e s t i m a t e s . S t a nd a r d e rr o r s a r e r e p o r t e d i np a r e n t h e s e s . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . S t a nd a r d e rr o r s a d j u s t e d f o r c l u s t e r i n ga tt h ec o un t r y l e v e l a r e r e p o r t e d i n c o l u m n . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . a b l e B . : T h e S tr u c t u r e o f T a r i ﬀ s a nd L o n g - T e r m G r o w t h : I ndu s tr y - l e v e l e s t i m a t e s ( )( )( )( )( )( )( )( ) L a ss o R e g . T r ee B oo s t i n g F o r e s t N e u r a l N e t . E n s e m b l e B e s t O L S P a n e l A : S k i ll t a r i ﬀ c o rr e l a t i o n S k ill t a r i ﬀ c o rr e l a t i o n . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l B : T a r i ﬀ d i ﬀ e r e n t i a l ( l o w c u t - o ﬀ ) T a r i ﬀ d i ﬀ e r e n t i a l ( l o w c u t - o ﬀ ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) P a n e l C : T a r i ﬀ d i ﬀ e r e n t i a l ( h i g h c u t - o ﬀ ) T a r i ﬀ d i ﬀ e r e n t i a l ( h i g h c u t - o ﬀ ) . . . . . . . . ( . )( . )( . )( . )( . )( . )( . )( . ) O b s e r v a t i o n s R a w c o v a r i a t e s N o t e s : A n a l y s i s o f T a b l e ( c o l u m n s , , ) o f N unn a nd T r e ﬂ e r ( ) u s i n g D M L . C o l u m n ( ) r e p o r t s t h e o r i g i n a l p a p e r e s t i m a t e s . S t a nd a r d e rr o r s a r e r e p o r t e d i np a r e n t h e s e s . S t a nd a r d e rr o r s a d j u s t e d f o r v a r i a b ili t y a c r o sss p li t s u s i n g t h e m e d i a n m e t h o d a r e r e p o r t e d f o r t h e D M L e s t i m a t e s . S t a nd a r d e rr o r s a d j u s t e d f o r c l u s t e r i n ga tt h ec o un t r y l e v e l a r e r e p o r t e d i n c o l u m n . T h e nu m b e r o f c o v a r i a t e s d o e s n o t i n c l ud e t h e t r e a t m e n t v a r i a b l e . (1) (2) (3)CATE below median CATE above median p -value for the diﬀerencePopulation, diﬀ. btw. 2000 and 1990 0.00413 (0.00242) 0.00806 (0.00189) 0.20027 Share with high school degree, diﬀ. btw. 2000 and 1990 0.0086 (0.00199) 0.0029 (0.0027) 0.08938

Share with some college, diﬀ. btw. 2000 and 1990 0.00736 (0.00207) 0.0039 (0.00227) 0.26069Share with college degree, diﬀ. btw. 2000 and 1990 0.00757 (0.00272) 0.00582 (0.00191) 0.59872

Share male, diﬀ. btw. 2000 and 1990 0.00949 (0.00222) 0.0035 (0.00231) 0.06126

Share African American, diﬀ. btw. 2000 and 1990 0.00629 (0.00243) 0.00666 (0.002) 0.90674Share Hispanic, diﬀ. btw. 2000 and 1990 0.00428 (0.00238) 0.00737 (0.00208) 0.32866Unemployment rate, diﬀ. btw. 2000 and 1990 0.00366 (0.00238) 0.00866 (0.00224) 0.12612Married, diﬀ. btw. 2000 and 1990 0.00698 (0.00202) 0.00562 (0.00257) 0.67592Median income, diﬀ. btw. 2000 and 1990 0.00628 (0.00224) 0.00653 (0.0023) 0.93661Share urban, diﬀ. btw. 2000 and 1990 0.00517 (0.00203) 0.00945 (0.0025) 0.18368Population 2000 0.00492 (0.00252) 0.00662 (0.00164) 0.57185

Share with some college 2000 0.00328 (0.00204) 0.00964 (0.00249) 0.04809

Share with college degree 2000 0.00556 (0.00253) 0.00679 (0.00185) 0.6946Share male 2000 0.0055 (0.00194) 0.00976 (0.00277) 0.20794Share African American 2000 0.0025 (0.00271) 0.00739 (0.00172) 0.12759

Share Hispanic 2000 0.00136 (0.00225) 0.00799 (0.00217) 0.03386

Employment rate 2000 0.00557 (0.00232) 0.00771 (0.00215) 0.50069Unemployment rate 2000 0.00541 (0.00214) 0.00741 (0.00235) 0.52906Share married 2000 0.00683 (0.00228) 0.00585 (0.00229) 0.76121Median income 2000 0.00501 (0.00218) 0.00712 (0.00223) 0.50006Share urban 2000 0.00441 (0.0024) 0.00673 (0.0019) 0.44815No. potential cable subscribers 2000 0.00818 (0.00238) 0.00594 (0.00169) 0.44436Decile 1 in no. potential cable subscribers 0.00661 (0.0016) -0.00787 (0.01626) 0.37539Decile 2 in no. potential cable subscribers 0.00664 (0.00165 ) 0.00084 (0.00861) 0.50799

Decile 3 in no. potential cable subscribers 0.00612 (0.00151) 0.0171 (0.0065) 0.0999

Decile 4 in no. potential cable subscribers 0.00634 (0.0017) 0.0077 (0.00393) 0.75084Decile 5 in no. potential cable subscribers 0.00667 (0.00174) 0.00357 (0.00371) 0.44915Decile 6 in no. potential cable subscribers 0.00669 (0.00171) 0.00471 (0.00463) 0.68762Decile 7 in no. potential cable subscribers 0.00668 (0.0017) 0.00531 (0.00492) 0.79269

Decile 8 in no. potential cable subscribers 0.00758 (0.00168) -0.00131 (0.00405) 0.04239

Decile 9 in no. potential cable subscribers 0.0071 (0.00167) 0.00226 (0.00317) 0.17685

Decile 10 in no. potential cable subscribers 0.0045 (0.00207) 0.01139 (0.00188) 0.01393

No. cable channels available 2000 0.00816 (0.00684) 0.0065 (0.00148) 0.812Decile 1 in no. cable channels available 0.00645 (0.0017) 0.00149 (0.02648) 0.85167Decile 2 in no. cable channels available 0.00655 (0.00153) 0.01884 (0.0383) 0.74845Decile 3 in no. cable channels available 0.00657 (0.00161) 0.00758 (0.01372) 0.94203Decile 4 in no. cable channels available 0.00747 (0.00155) -0.0101 (0.01332) 0.18996Decile 5 in no. cable channels available 0.00553 (0.00158) 0.02402 (0.01216) 0.13149Decile 6 in no. cable channels available 0.00569 (0.00164) 0.01323 (0.00648) 0.25923Decile 7 in no. cable channels available 0.00585 (0.00192) 0.00953 (0.00253) 0.24565Decile 8 in no. cable channels available 0.0068 (0.00178) 0.00355 (0.00398) 0.45524

Decile 9 in no. cable channels available 0.00576 (0.00181) 0.01239 (0.00288) 0.05169

Swing district 0.00685 (0.00201) 0.00602 (0.00272) 0.80599

Republican district 0.00693 (0.00187) 0.00084 (0.00264) 0.06021

Notes:

The table reports the eﬀect of Fox News on the Republican vote sharefor towns with values below (column 1) and above (column 2) the median of eachvariable. Column 3 presents the p -value for the null of no diﬀerence between theestimates in columns 1 and 2. Standard errors are reported in parentheses. Theestimates are obtained from the causal random forest that includes district dummyvariables. As we are not interested in exploring heterogeneity along the congressionaldistricts, the HTE results for district dummy variables are omitted from the table. (1) (2) (3)CATE below median CATE above median p -value for the diﬀerencePopulation, diﬀ. btw. 2000 and 1990 0.00357 (0.00398) 0.00829 (0.00299) 0.34201Share with high school degree, diﬀ. btw. 2000 and 1990 0.0088 (0.00311) 0.00225 (0.00323) 0.14407Share with some college, diﬀ. btw. 2000 and 1990 0.00809 (0.00281) 0.00194 (0.00431) 0.23202Share with college degree, diﬀ. btw. 2000 and 1990 0.00709 (0.00317) 0.00604 (0.00329) 0.8194Share male, diﬀ. btw. 2000 and 1990 0.00975 (0.00356) 0.00308 (0.00268) 0.13407Share African American, diﬀ. btw. 2000 and 1990 0.00547 (0.00346) 0.007 (0.00298) 0.7364Share Hispanic, diﬀ. btw. 2000 and 1990 0.00369 (0.00383) 0.00755 (0.00286) 0.41946Unemployment rate, diﬀ. btw. 2000 and 1990 0.00328 (0.00304) 0.00872 (0.00308) 0.20834Married, diﬀ. btw. 2000 and 1990 0.00622 (0.00327) 0.00639 (0.00339) 0.97002Median income, diﬀ. btw. 2000 and 1990 0.0065 (0.00354) 0.00609 (0.00282) 0.92735Share urban, diﬀ. btw. 2000 and 1990 0.00527 (0.00273) 0.00881 (0.00372) 0.44257Population 2000 0.00577 (0.00398) 0.00636 (0.0027) 0.9022Share with some college 2000 0.00532 (0.00321) 0.00785 (0.00376) 0.60916Share with college degree 2000 0.00545 (0.00296) 0.00672 (0.00318) 0.76975Share male 2000 0.00459 (0.00259) 0.01138 (0.00529) 0.24942Share African American 2000 0.00198 (0.00518) 0.00731 (0.00265) 0.35943Share Hispanic 2000 0.00071 (0.00378) 0.00825 (0.00314) 0.1245Employment rate 2000 0.0043 (0.00293) 0.00892 (0.00416) 0.36452Unemployment rate 2000 0.00539 (0.0027) 0.00728 (0.0035) 0.66907Share married 2000 0.00684 (0.00278) 0.00561 (0.00355) 0.78466Median income 2000 0.00546 (0.00381) 0.00648 (0.00272) 0.82677Share urban 2000 0.00534 (0.00404) 0.00647 (0.00276) 0.81683No. potential cable subscribers 2000 0.00744 (0.00616) 0.00587 (0.00285) 0.81685Decile 1 in no. potential cable subscribers 0.00653 (0.00264) -0.00486 (0.0162) 0.48767Decile 2 in no. potential cable subscribers 0.00655 (0.00268) 0.00209 (0.0116) 0.70797Decile 3 in no. potential cable subscribers 0.00594 (0.00258) 0.01893 (0.0111) 0.25437Decile 4 in no. potential cable subscribers 0.00628 (0.00256) 0.00734 (0.00928) 0.91234Decile 5 in no. potential cable subscribers 0.00677 (0.00273) 0.00051 (0.00724) 0.4189Decile 6 in no. potential cable subscribers 0.00691 (0.00278) 0.00113 (0.00592) 0.37691Decile 7 in no. potential cable subscribers 0.00685 (0.00241) 0.00351 (0.01161) 0.77827Decile 8 in no. potential cable subscribers 0.00741 (0.00283) -0.00051 (0.004) 0.10608Decile 9 in no. potential cable subscribers 0.00683 (0.00294) 0.00274 (0.00409) 0.41683Decile 10 in no. potential cable subscribers 0.004 (0.00384) 0.01147 (0.00369) 0.16066No. cable channels available 2000 0.00562 (0.00773) 0.00678 (0.00255) 0.88643Decile 1 in no. cable channels available 0.00646 (0.00274) -0.00726 (0.03324) 0.68079Decile 2 in no. cable channels available 0.00644 (0.00265) 0.01768 (0.01293) 0.39453Decile 3 in no. cable channels available 0.00672 (0.00267) 0.00203 (0.01179) 0.69811Decile 4 in no. cable channels available 0.00816 (0.00263) -0.01645 (0.01506) 0.10755Decile 5 in no. cable channels available 0.00484 (0.00251) 0.02998 (0.01806) 0.16774Decile 6 in no. cable channels available 0.00537 (0.00289) 0.0133 (0.0045) 0.13846Decile 7 in no. cable channels available 0.00602 (0.00304) 0.00867 (0.0042) 0.60869Decile 8 in no. cable channels available 0.00675 (0.00291) 0.00327 (0.00503) 0.54915Decile 9 in no. cable channels available 0.00543 (0.0027) 0.0139 (0.00631) 0.21689Swing district 0.00634 (0.00308) 0.00736 (0.00515) 0.86489Republican district 0.00665 (0.00286) 0.00079 (0.00604) 0.38064 Notes:

Notes:

The table reports the importance of each variable obtained from the causalforest with district dummies. Variables with importance lower than 0.01% are omit-ted.

Variable Importance (%)No. cable channels available 2000 10.4No. potential cable subscribers 2000 8.22Share with some college 2000 4.79Unemployment rate, diﬀ. btw. 2000 and 1990 4.35Decile 9 in no. cable channels 4.16Decile 10 in no. cable channels 4.09Employment rate, diﬀ. btw. 2000 and 1990 3.86Share African American 2000 3.59Median income 2000 3.39Population, diﬀ. btw. 2000 and 1990 3.31Median income, diﬀ. btw. 2000 and 1990 3.1Share married 2000 2.9Share male, diﬀ. btw. 2000 and 1990 2.88Decile 7 in no. cable channels 2.84Unemployment rate 2000 2.56Share African American, diﬀ. btw. 2000 and 1990 2.55Share Hispanic, diﬀ. btw. 2000 and 1990 2.5Share married, diﬀ. btw. 2000 and 1990 2.4Share Hispanic 2000 2.38Share with high school degree, diﬀ. btw. 2000 and 1990 2.28Share urban, diﬀ. btw. 2000 and 1990 2.23Share male 2000 2.2Decile 10 in no. potential cable subscribers 2.19Population 2000 2.14Share with some college, diﬀ. btw. 2000 and 1990 2.08Share with college degree 2000 1.98Employment rate 2000 1.79Share with college degree, diﬀ. btw. 2000 and 1990 1.58Share with high school degree 2000 1.57Share urban 2000 1.55Republican district 1.37Swing district 0.98Decile 8 in no. cable channels 0.97Decile 9 in no. potential cable subscribers 0.51Decile 8 in no. potential cable subscribers 0.21Decile 7 in no. potential cable subscribers 0.06Decile 6 in no. cable channels 0.03Decile 6 in no. potential cable subscribers 0.02

Notes:

The table reports the importance of each variable obtained from the cluster-robust causal forest. Variables with importance lower than 0.01% are omitted. (1) (2) (3)Elastic net Neural network Random forestBest BLP 0.012 0.014 0.011Best GATES 0.115 0.121 0.099

Notes:

The table compares the performance of the three ML methods used to pro-duce the proxy predictors. The performance measures Best BLP and Best GATESare computed as medians over 100 splits.

Table B.10: Teacher Training - Generic Method: GATES of most and leastaﬀected groups (1) (2) (3)20% most aﬀected 20% least aﬀected DiﬀerenceEﬀect of teacher training on student achievement 0.164 -0.179 0.36590% Conﬁdence Interval (0.048,0.279) (-0.301,-0.058) (0.198,0.533) p -value 0.011 0.092 0.001 Notes:

The estimates are obtained using neural network to produce the proxy pre-dictor

S(Z) . The values reported correspond to the medians over 100 splits. (1) (2) (3)20% most aﬀected 20% least aﬀected p -value for the diﬀerenceBaseline instructional practices of teacher 0.211 0.053 0.000(0.149,0.272) (-0.009,0.114)Teacher’s baseline classroom management 0.065 0.074 0.000(0.003,0.127) (0.012,0.136)Teacher gender 0.529 0.536 1.000(0.497,0.562) (0.504,0.568)Teacher certiﬁcation dummy 0.970 1.000 0.000(0.962,0.977) (0.992,1.008)Student’s baseline instrumental motivation for math -0.111 0.224 0.000(-0.173,-0.050) (0.162,0.285)Student’s baseline time spent each week studying math -0.073 0.204 0.000(-0.142,-0.004) (0.135,0.273)Student’s baseline math self-concept -0.380 0.317 0.000(-0.441,-0.319) (0.256,0.378)Teacher majored in math 0.309 0.507 0.000(0.278,0.340) (0.475,0.538)Mother education level 0.570 0.425 0.000(0.538,0.602) (0.393,0.457)Teacher’s baseline communication -0.088 0.251 0.000(-0.152,-0.025) (0.187,0.314)Student’s baseline intrinsic motivation for math -0.294 0.299 0.000(-0.356,-0.232) (0.237,0.362)Baseline teacher care -0.165 0.311 0.000(-0.228,-0.103) (0.249,0.374)Household asset index -0.421 0.240 0.000(-0.491,-0.351) (0.170,0.310)Father education level 0.583 0.589 1.000(0.551,0.614) (0.557,0.621) Notes:

The table shows the average value of the teacher and student characteristicsfor the most and least aﬀected groups. The estimates are obtained using neuralnetwork to produce the proxy predictor

S(Z) . Conﬁdence intervals with 90% nominallevel are reported in parenthesis. All variables, except

Teacher gender, Teachercertiﬁcation dummy, Teacher majored in math, Mother education level and

Fathereducation level are normalized. The values reported correspond to the medians over100 splits. S ( Z ) Variable CorrelationTeacher college degree -0.237Teacher training hours 0.125Teacher ranking 0.116Student age 0.111Teacher experience (years) 0.101Student female -0.094Teacher age 0.089Math score at baseline (normalized) 0.075Student baseline math anxiety 0.063Class size -0.060Baseline instructional practices of teacher 0.053Teacher’s baseline classroom management 0.051Teacher gender -0.045Teacher certiﬁcation dummy 0.036Student’s baseline instrumental motivation for math 0.025Student’s baseline time spent each week studying math 0.022Student’s baseline math self-concept -0.021Teacher majored in math -0.016Mother education level 0.009Teacher’s baseline communication -0.008Student’s baseline intrinsic motivation for math 0.008Baseline teacher care -0.006Household asset index -0.005Father education level -0.003

Notes:

The table reports the correlation of each covariate with the proxy predictor S ( Z ). Figure 2.1: ˆ m ( · )Figure 2.2: ˆ g ( · ) Notes:

The ﬁgure plots the seven largest lasso coeﬃcients of the interaction terms,obtained estimating the nuisance functions m ( · ) and g ( · ). Colons indicate interac-tions of variables. The treatment variable D, is the ﬁrst year eﬀective corporate taxrate. The dependent variable Y is the average entry rate. The lasso coeﬃcients arecalculated as the median over splits.). Colons indicate interac-tions of variables. The treatment variable D, is the ﬁrst year eﬀective corporate taxrate. The dependent variable Y is the average entry rate. The lasso coeﬃcients arecalculated as the median over splits.