[PDF] Model selection for estimation of causal parameters

Abstract

A popular technique for selecting and tuning machine learning estimators is cross-validation. Cross-validation evaluates overall model fit, usually in terms of predictive accuracy. This may lead to models that exhibit good overall predictive accuracy, but can be suboptimal for estimating causal quantities such as the average treatment effect. We propose a model selection procedure that estimates the mean-squared error of a one-dimensional estimator. The procedure relies on knowing an asymptotically unbiased estimator of the parameter of interest. Under regularity conditions, we show that the proposed criterion has asymptotically equal or lower variance than competing procedures based on sample splitting. In the literature, model selection is often used to choose among models for nuisance parameters but the identification strategy is usually fixed across models. Here, we use model selection to select among estimators that correspond to different estimands. More specifically, we use model selection to shrink between methods such as augmented inverse probability weighting, regression adjustment, the instrumental variables approach, and difference-in-means. The performance of the approach for estimation and inference for average treatment effects is evaluated on simulated data sets, including experimental data, instrumental variables settings, and observational data with selection on observables.

Full PDF

MModel selection for estimation of causalparameters

Dominik Rothenh¨auslerStanford UniversitySeptember 1, 2020

Abstract

A popular technique for selecting and tuning machine learning esti-mators is cross-validation. Cross-validation evaluates overall model ﬁt,usually in terms of predictive accuracy. This may lead to models thatexhibit good overall predictive accuracy, but can be suboptimal for esti-mating causal quantities such as the average treatment eﬀect.We propose a model selection procedure that estimates the mean-squared error of a one-dimensional estimator. The procedure relies onknowing an asymptotically unbiased estimator of the parameter of inter-est. Under regularity conditions, we show that the proposed criterion hasasymptotically equal or lower variance than competing procedures basedon sample splitting.In the literature, model selection is often used to choose among mod-els for nuisance parameters but the identiﬁcation strategy is usually ﬁxedacross models. Here, we use model selection to select among estimatorsthat correspond to diﬀerent estimands. More speciﬁcally, we use modelselection to shrink between methods such as augmented inverse probabil-ity weighting, regression adjustment, the instrumental variables approach,and diﬀerence-in-means. The performance of the approach for estimationand inference for average treatment eﬀects is evaluated on simulated datasets, including experimental data, instrumental variables settings and ob-servational data with selection on observables.

Model selection is a fundamental task in statistical practice. Usually, the aim isto ﬁnd a model that optimizes overall model ﬁt. If the loss function is quadratic,the total error can be decomposed into the error due to variance and the er-ror due to bias. A popular technique to balance the bias-variance trade-oﬀ inthe context of prediction is cross-validation. However, models that have highpredictive accuracy do not always yield accurate parameter estimates. In theliterature, there exist some solutions for parameter-speciﬁc model selection, but1 a r X i v : . [ s t a t . M E ] A ug e currently lack a reliable general-purpose tool. In the context of causal infer-ence, a reliable parameter-speciﬁc model selection tool could enable the followingapplications. Estimating average treatment eﬀects from observational data can be unreliable,if conditional treatment assignment probabilities are close to zero or one. Toaddress this issue, some applied researchers “move the goalpost” by switchingto causal contrasts that can be estimated more reliably, see LaLonde (1986),Heckman et al. (1998), Crump et al. (2006) and references therein. For example,instead of estimating the average treatment eﬀect (ATE), a practitioner mightswitch to estimating the average treatment eﬀect on the treated (ATT), orother causal contrasts such as the overlap eﬀect. This is often done in an ad-hoc fashion. Using a model selection tool in this context could allow to tradeoﬀ bias and variance when switching from estimating the ATE to estimators ofother causal contrasts.

Combining evidence across data sets in the context of causal inference has re-cently attracted increasing interest (Peters et al., 2016; Bareinboim and Pearl,2016; Athey et al., 2020). However, data quality often varies from data set todata set. In this case, using all data sets can lead to untrustworthy, biasedestimators. A reliable model selection tool could make it possible to distinguishwhich data sets and estimators are useful for solving the estimation problem athand, and which are not.

Researchers are often uncomfortable with using statistical procedures that onlywork under strong assumptions. Using such methods over other proceduresmay introduce some bias if the assumptions are violated, but has the potentialto reduce variance. For example, conditional independence assumptions can beleveraged to improve precision of treatment eﬀect estimators (Athey et al., 2019;Guo and Perkovi´c, 2020). A model selection tool as described above would allowto systematically trade oﬀ bias and variance when switching between estimationprocedures that work under diﬀerent sets of assumptions. Doing so can poten-tially improve precision in scenarios where researchers are not comfortable withmaking strong assumptions.

Model selection has a long history in statistics and machine learning. For opti-mizing loss-based estimators, the most commonly used methods include cross-validation, the Akaike information criterion, and the Bayesian information cri-2erion (Akaike, 1974; Schwarz, 1978; Friedman et al., 2001; Arlot and Celisse,2010).The focused information criterion is a model selection criterion which, for agiven focus parameter, estimates the mean-squared error of submodels (Claeskensand Hjort, 2003, 2008). It relies on knowing an asymptotically unbiased estima-tor of the parameter of interest. Its theoretical justiﬁcation is given in a localmisspeciﬁcation framework.More recently, Cui and Tchetgen-Tchetgen (2019) introduce a model selec-tion tool for ﬁnite-dimensional functionals in a semiparametric model if a doublyrobust estimation function is available. It is based on a pseudo-risk criterionthat has a robustness property if one of the estimators is biased.For the task of model selection when estimating heterogeneous treatmenteﬀects, several methods have been developed (Kapelner et al., 2014; Rolling andYang, 2014; Athey and Imbens, 2016; Nie and Wager, 2017; Zhao et al., 2017;Powers et al., 2018). Most of the methodologies are speciﬁc to the consideredmodel class. A comparison of this line of work for individual treatment eﬀectscan be found in Schuler et al. (2018).Van der Laan and Robins (2003) propose a loss-based approach for parameter-speciﬁc model selection. In this work, the authors recommend minimizing anempirical estimate of the overall risk R (ˆ θ g , ˆ η ), where ˆ θ g , g = 1 , . . . , G are candi-date estimators and ˆ η is an eﬃcient estimator of the nuisance parameter, com-puted on the training data set. Our approach is more generic in the sense thatwe do not assume that parameter of interest minimizes a known loss function.Closest to our work is the cross-validation criterion developed by Brookhartand Van Der Laan (2006). It proceeds as follows. The data D = ( D , . . . , D n )is randomly split into K disjoint roughly equally-sized folds D , , . . . , D ,K .Deﬁne D ,k = D \ D ,k . If the data is i.i.d., D ,k and D ,k are independent foreach k . Let ˆ θ be an unbiased estimator of the parameter of interest θ , and ˆ θ g be candidate estimators, g = 0 , . . . , G . Then, we can compute the risk criterion˜ R ( g ) = 1 K K (cid:88) k =1 (ˆ θ g ( D ,k ) − ˆ θ ( D ,k )) . (1)Using independence of D ,k and D ,k , E [ ˜ R ( g )] = E [(ˆ θ g ( D , ) − θ ) ] + Var(ˆ θ ( D , )) . As Var(ˆ θ ( D , )) is constant in g , the criterion in equation (1) can be used toselect an estimator ˆ θ g with low mean-squared error for estimating θ amongˆ θ g , g = 0 , . . . , G . The criterion in equation (1) is attractive as it is simple andwidely applicable. We will compare the proposed model selection criterion tothe criterion in equation (1), both in theoretical results and in simulations. In this paper, we work towards making parameter-speciﬁc model selection morereliable. We derive a model selection criterion that estimates the mean-squared3rror of a one-dimensional estimator. We show that the selected model hasequal or lower variance than the baseline estimator asymptotically. Comparedto the parameter-speciﬁc cross-validation approach developed by Brookhart andVan Der Laan (2006), we show that the proposed criterion has equal or lowervariance asymptotically. The proposed criterion is ﬂexible in the sense that itcan be used for any one-dimensional estimator in both parametric and semi-parametric settings. Theoretical justiﬁcation of the method is given under theassumption of asymptotic linearity.In previous work, the goal of model selection for one-dimensional estimationis usually to select a nuisance parameter model from a set of candidate models,but the identiﬁcation strategy is held ﬁxed across models (Van der Laan andRobins, 2003; Brookhart and Van Der Laan, 2006; Cui and Tchetgen-Tchetgen,2019). In this paper, the goal is to select among diﬀerent estimands, or identiﬁ-cation strategies. Mathematically, this correspond to selecting among diﬀerent functionals of the underlying data generating distribution. In some situations,this results in dramatic improvements in the mean-squared error. However,there is no free lunch. Compared to the baseline procedure, for ﬁxed n , modelselection can lead to increased risk in parts of the parameter space.In simulations, we demonstrate that the proposed criterion allows to trade oﬀbias and variance in a variety of scenarios, including experiments, instrumentalvariables settings and data with selection on observables. The code can be foundat github.com/rothenhaeusler/tms. In Section 2, we introduce a method for parameter-speciﬁc model selection anddiscuss applications in causal inference. Theory for the method is discussed inSection 3. We evaluate the performance of the proposed procedure on simulateddata in Section 4.

This section consists of two parts. We brieﬂy discuss the setting in Section 2.1.Then, we introduce the method in Section 2.2 and discuss basic properties.

We observe data D = ( D i , i = 1 , . . . , n ), where the D i are independentlydrawn from some unknown distribution P . Suppose we have access to estimatorsˆ θ g ( D ), g = 0 , . . . , G , of some unknown parameter θ . In the following, tosimplify notation, we will write ˆ θ g instead of ˆ θ g ( D ). We assume that the baselineestimator ˆ θ is asymptotically unbiased for θ , i.e. that E [ˆ θ ] = θ + o ( n − / ).In addition, for g (cid:54) = 0, we assume that E [ˆ θ g ] = θ g + o ( n − / ) for some unknown θ g . 4 .2 The method We aim to ﬁnd the estimator g that minimizes R ( g ) = E [(ˆ θ g − θ ) ] . (2)Here and in the following, we suppress the dependence of R ( g ) on n . As biasand variance of ˆ θ g are unknown, the function R ( g ) is unknown and one hasto minimize a proxy of the risk R ( g ) instead. We propose to estimate R ( g ) inequation (2) via ˆ R ( g ) = (ˆ θ g − ˆ θ ) − ˆVar(ˆ θ g − ˆ θ ) + ˆVar(ˆ θ g ) , (3)where ˆVar(ˆ θ g − ˆ θ ) is an estimator of the variance of ˆ θ g − ˆ θ and ˆVar(ˆ θ g ) is anestimator of the variance of ˆ θ g . If explicit expressions for variance estimatorsof ˆ θ g − ˆ θ or ˆVar(ˆ θ g ) are not available, we recommend estimating the variancesvia the bootstrap. We propose to choose a ﬁnal estimate ˆ θ ˆ g by solvingˆ g = arg min g ˆ R ( g ) . (4)In the following, we will sketch the theoretical justiﬁcation for the approachin equation (4). The main reason to use the risk estimator in equation (3)over sample-splitting based methods is the potential reduction in asymptoticvariance. We will discuss this further in Section 3. The method is evaluated onsimulated data sets in Section 4. Under regularity conditions, we have E [ ˆVar(ˆ θ g − ˆ θ ) + ˆVar(ˆ θ g )] = Var(ˆ θ g − ˆ θ ) + Var(ˆ θ g ) + o (1 /n ) . (5)If θ g = θ , combining equation (3) with equation (5) yields E [ ˆ R ( g )] = Var(ˆ θ g − ˆ θ ) − Var(ˆ θ g − ˆ θ ) + Var(ˆ θ g ) + o (1 /n )= E [(ˆ θ g − θ ) ] + o (1 /n )= R ( g ) + o (1 /n ) . Note that in this case under regularity conditions usually R ( g ) = O (1 /n ). Sim-ilarly if θ g (cid:54) = θ , we obtain E [ ˆ R ( g )] = R ( g ) + o (1 / √ n ) . Under regularity conditions, in this case usually R ( g ) = O (1). Thus, we haveseen that equation (3) is approximately unbiased for the mean-squared errorof ˆ θ g if the approximation in equation (5) holds. As discussed in the sketch,(ˆ θ g − ˆ θ ) − Var(ˆ θ g − ˆ θ ) is an approximately unbiased estimator of the squared5ias ( θ g − θ ) . As we know that the squared bias is nonnegative, we can improveprecision by deﬁning the modiﬁed risk criterionˆ R mod ( g ) = ((ˆ θ g − ˆ θ ) − ˆVar(ˆ θ g − ˆ θ )) + + ˆVar(ˆ θ g ) . (6)Then the ﬁnal estimator ˆ θ ˆ g is chosen such that ˆ g minimizes equation (6). Ifthere are ties, we select ˆ g as the one that minimizes (ˆ θ g − ˆ θ ) among the g thatsatisfy ˆ R mod ( g ) = min g ˆ R mod ( g ). The criterion ˆ R mod ( g ) is not asymptotiallyunbiased for R ( g ), but has some favorable statistical properties that we willdiscuss in the following section. In this section we discuss the theoretical underpinnings and computational issuesof the method introduced in Section 2. First, we show that the criterion ˆ R ( g ) isasymptotically unbiased for estimating the mean-squared error R ( g ) = E [(ˆ θ g − θ ) ]. Secondly, we compare the criterion ˆ R ( g ) to cross-validation in termsof asymptotic bias and variance. Then, we discuss the asymptotic risk of theresulting estimator. Finally, we derive a computational shortcut for formingapproximate bootstrap conﬁdence intervals. We have to make some assumptions that guarantee that the estimators areasymptotically well-behaved. We make two major assumptions, in addition tothe assumptions outlined in Section 2.1. The ﬁrst major assumption is a slightlystronger version of asymptotic linearity. Asymptotic linearity is an assumptionthat is commonly made to justify asymptotic normality of an estimator (Van derVaart, 2000; Tsiatis, 2007). As our goal is to estimate the mean-squared error ofan estimator, we use a slightly stronger version that guarantees convergence ofsecond moments. The second major assumption is that the variance estimatesare consistent.

Assumption 1.

We make two major assumptions.1. Let ˆ θ g , g = 0 , . . . , G be estimators such thatˆ θ g − θ g = 1 n n (cid:88) i =1 ψ g ( D i ) + e g ( n )where ψ g ( D i ) is centered and has ﬁnite nonzero second moments, and E [ e g ( n ) ] = o (1 /n ). To avoid trivial special cases, in addition we assume | Cor( ψ g ( D ) , ψ ( D )) | (cid:54) = 1.2. The estimators of variance are consistent, that means n ˆVar(ˆ θ g − ˆ θ ) − lim n n Var(ˆ θ g − ˆ θ ) = o P (1) ,n ˆVar(ˆ θ g ) − lim n n Var(ˆ θ g ) = o P (1) . e g ( n ) = o P (1 /n ) while we assume that E [ e g ( n ) ] = o (1 /n ). Thus, our assumption is slightly stronger than asymptoticlinearity. Let us now turn to our main theoretical results. Our ﬁrst result shows that the proposed criterion is asymptotically unbiasedfor the mean-squared error of ˆ θ g . The convergence rate depends on whetherˆ θ g is asymptotically biased for estimating θ . If the estimator ˆ θ g is unbiased,the convergence rate is faster than if the estimator is biased. The proof of thefollowing result can be found in the Appendix. Theorem 1 (Asymptotic unbiasedness of ˆ R ( g )) . Let Assumption 1 hold.1. If θ g = θ , n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ])converges to a random variable with zero mean.2. If θ g (cid:54) = θ , √ n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ])converges to a random variable with zero mean.This result shows that the estimator is asymptotically unbiased, which isimportant for the theoretical justiﬁcation of the approach. The major strengthof the approach will become apparant in the next section, where we compareasymptotic bias and variance to the cross-validation criterion (1). In this section, we will show that the proposed criterion has asymptotically lowervariance than the cross-validation criterion (1) if θ g = θ . Furthermore, we showthat cross-validation is generally biased for estimating the mean-squared errorif θ g = θ . Let us ﬁrst discuss asymptotic bias of the naive approach. The shortproof of this result can be found in the Appendix. Theorem 2 (Asymptotic biasedness of ˜ R ( g )) . Let Assumption 1 hold and ﬁx K .1. If θ g = θ , and Var( ψ g ( D )) (cid:54) = Var( ψ ( D )), n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ])converges to a random variable with nonzero mean , where the mean de-pends on g . 7. If θ g (cid:54) = θ , √ n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ])converges to a random variable with zero mean.Comparing this result with Theorem 1, we see a major diﬀerence how themethods behave when θ g = θ . The proposed criterion is asymptotically un-biased for the mean-squared error, while cross-validation is not. The cross-validation approach computes the estimator on randomly chosen subsets of thedata. Thus, intuitively, this approach overestimates the asymptotic variance ofthe estimator. This defect of cross-validation could be solved by subtracting acorrection term.Now, let us turn to the asymptotic variance of the model selection criteria.The proof of the following result can be found in the Appendix. Theorem 3 (Asymptotic variance of model selection criteria) . Let Assump-tion 1 hold and ﬁx K .1. If θ = θ g , the asymptotic variance of n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ]) is strictlylower than the asymptotic variance of n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]),2. If θ (cid:54) = θ g , the asymptotic variance of √ n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ]) is equalto the asymptotic variance of √ n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]).Roughly speaking, this theorem shows that the proposed criterion ˆ R ( g ) hasequal or lower asymptotic variance than the cross-validation criterion ˜ R ( g ).The diﬀerence in variance leads to substantially diﬀerent resulting risk in somescenarios as we will see in the simulation section. The diﬀerence in variance isparticularly large if the splits are imbalanced, i.e. if the test data sets have muchsmaller or much larger sample size than the training data sets. Intuitively, ifthe validation data sets have small sample size, validation becomes unstable.However, if the validation data set is large, the estimator on the training datasets becomes unstable. This tradeoﬀ can be avoided by estimating bias andvariance separately, as done in the proposed approach. Here and in the following, we focus on the case where the number of models G is small and ﬁxed and n → ∞ . We are focusing on the case where G is small, aswe expect the method to be most reliable in scenarios where G is small. Thiswill be further discussed in Section 3.5. First, we investigate the asymptoticbehaviour of the proposed procedure in the case where the number of modelsis ﬁxed and n → ∞ . The proof of the following result can be found in theAppendix. Corollary 1 (Asymptotic risk of selected model) . Let Assumption 1 hold. Con-sider a ﬁnite and ﬁxed number of estimators g = 0 , . . . , G . Letˆ g = arg min ˆ R mod ( g ) . n → ∞ , P [ θ ˆ g = θ ] → , and P [ R (ˆ g ) ≤ R (0)] → . In words, for n → ∞ , the proposed method select models with lower orequal risk than the baseline estimator ˆ θ . Interestingly, an analogous result does not hold for the cross-validation procedure (1). Even for n → ∞ , the modelselected by cross-validation may have larger risk than the baseline procedureˆ θ with positive probability. Note that Corollary 1 does not state that theprocedure selects the optimal model asymptotically. This is indeed not true.It is straightforward to construct a model selection procedure that selects theoptimal model asymptotically by placing additional weight on the variance term.For example, one might consider ˆ R alt ( g ) = ˆ R mod ( g ) + √ n ˆVar(ˆ θ g ). However,in our experience, placing additional weight on the variance term can lead toproblematic performance compared to the baseline estimator ˆ θ in ﬁnite samplesand thus is not recommended. Deriving conﬁdence intervals that are valid in conjuction with a model selectionstep is a challenging topic and has attracted substantial interest in recent years,see for example Berk et al. (2013); Taylor and Tibshirani (2015). Generallyspeaking, statistical inference after a model selection step can be unreliable ifthe uncertainty induced by the model selection step is ignored.Thus, in settings where the number of estimators G is small and n is large,we recommend forming conﬁdence intervals by bootstrapping the model selec-tion procedure. More speciﬁcally, we recommend creating bootstrap samples D ∗ , . . . , D ∗ B . Let ˆ g ( D ∗ b ) = arg min ˆ R mod ( g ), where ˆ R mod ( g ) is computed on thebootstrap sample D ∗ b . Let ˆ θ g ( D ∗ b ) be the estimator ˆ θ g , computed on the data set D ∗ b . Then we can deﬁne the ﬁnal estimate ˆ θ ∗ ,b = ˆ θ ˆ g ( ˆ D ∗ b ) ( D ∗ b ) for all b = 1 , . . . , B .The lower and upper limit of a 95% conﬁdence interval are then the empirical2 .

5% quantile and the empirical 97 . θ ∗ ,b , b = 1 , . . . , B , respec-tively. Bootstrapping the procedure in Section 2.2 can be computationally expensive,especially when Var(ˆ θ − ˆ θ g ) and Var(ˆ θ g ) are estimated via the bootstrap. Thereis a computational shortcut such that bootstrap conﬁdence intervals can beformed at virtually no additional computational cost: Note that the two varianceterms Var(ˆ θ − ˆ θ g ) and Var(ˆ θ g ) can often be estimated at rate o P (1 /n ). Inthis case, asymptotically, the dominant source of uncertainty in ˆ R mod ( g ) isestimation of ( θ g − θ ) . Hence, we can keep the estimators of the variance9Var(ˆ θ − ˆ θ g ) and ˆVar(ˆ θ − ˆ θ g ) ﬁxed across bootstrap samples. More speciﬁcally,on each bootstrap sample b = 1 , . . . , B we propose to minimize the risk criterionˆ R mod,shortcut,b ( g ) = ((ˆ θ g ( D ∗ b ) − ˆ θ ( D ∗ b )) − ˆVar(ˆ θ − ˆ θ g )) + + ˆVar(ˆ θ − ˆ θ g ) , whereˆVar(ˆ θ − ˆ θ g ) = 1 B − B (cid:88) b =1 (cid:32) ˆ θ ( D ∗ b ) − ˆ θ g ( D ∗ b ) − B (cid:88) b (cid:48) ˆ θ ( D ∗ b (cid:48) ) − ˆ θ g ( D ∗ b (cid:48) ) (cid:33) , ˆVar(ˆ θ g ) = 1 B − B (cid:88) b =1 (cid:32) ˆ θ g ( D ∗ b ) − B (cid:88) b (cid:48) ˆ θ g ( D ∗ b (cid:48) ) (cid:33) . Note that these terms only depend on B evaluations of ˆ θ g ( · ) for g = 0 , . . . , G ,which drastically speed up computation compared to naive computation of theintervals. We evaluate the actual coverage of intervals formed via this simpliﬁedbootstrap procedure in Section 4.Note that the conﬁdence intervals described above are only expected to bevalid in the large-sample limit and for a small number of estimators G . Forsmall sample size, the selection procedure might select a biased estimator withhigh probability. In this case, bootstrap conﬁdence intervals are not reliable ingeneral. If the procedure is tasked to select among a large number of estimators,a selective inference approach might be appropriate (Taylor and Tibshirani,2015). In this section, we discuss applications of the proposed methodology. Comparedto existing literature on model selection for causal eﬀects, instead of selectingamong nuisance parameter models, we consider shrinking between diﬀerent func-tionals of the data generating distribution. As we will see, doing so can leadto drastic improvements in the mean-squared error. However, there is no freelunch. Compared to the baseline procedure, for ﬁxed n , model selection canlead to increased risk in parts of the parameter space.In the following we will use potential outcomes to deﬁne causal eﬀects. Weare interested in the causal eﬀect of a treatment T ∈ { , } on an outcome Y .We use the potential outcome framework (Rubin, 1974; Splawa-Neyman et al.,1990). Let Y (1) denote the potential outcome under treatment T = 1 and Y (0)the potential outcome under treatment T = 0. We assume a superpopulationmodel, i.e. Y (1) and Y (0) are random variables. In the following, the goal is toestimate the average treatment eﬀect θ = E [ Y (1) − Y (0)] . (7)Many methods have been designed to estimate (7) and these methods operateunder a variety of assumptions. We present several applications that are based10n diﬀerent sets of assumptions for identifying (7). In each of the cases, wecompare the proposed method (6, termed “targeted selection”) with the cross-validation procedure (1) and with a baseline estimator. The code can be foundat github.com/rothenhaeusler/tms. In observational studies, it is common practice to estimate causal eﬀects un-der the assumption of unconfoundedness and under the overlap assumption.Roughly speaking, the overlap assumption states that treatment assignmentprobabilities are bounded away from zero and one, conditional on covariates X . If these assumptions are met, it is possible to identify the average treat-ment eﬀect via matching, inverse probability weighting, regression adjustment,or doubly robust methods (Hernan and Robins, 2020; Imbens and Wooldridge,2009). However, if the overlap is limited, estimating the average treatment eﬀectcan be unreliable.To deal with the issue of limited overlap, researchers sometimes switch todiﬀerent estimands such as the average eﬀect on the treated (ATT) or the overlapweighted eﬀect (Crump et al., 2006). In the following, we will focus on theoverlap-weighted eﬀect as it is the causal contrast that can be estimated withthe lowest asymptotic variance in certain scenarios (Crump et al., 2006). Theoverlap-weighted eﬀect is deﬁned as θ = E [ p ( T = 1 | X )(1 − p ( T = 1 | X )) τ ( X )] E [ p ( T = 1 | X )(1 − p ( T = 1 | X ))] , where τ ( x ) = E [ Y (1) − Y (0) | X = x ]. Note that if the treatment eﬀect ishomogeneous τ ( x ) ≡ const., then the overlap-weighted eﬀect and the averagetreatment eﬀect coincide, that means θ = θ . Thus, shrinkage towards aneﬃcient estimator of the overlap eﬀect is potentially beneﬁcial under treatmenteﬀect homogeneity.We investigate shrinking between estimators of the average treatment eﬀectand the overlap-weighted eﬀect in a data-driven way. The proposed modelselection tool will be used to trade oﬀ bias and variance. We observe 1000 independent and identically distributed draws ( Y i ( T i ) , T i , X i )of a distribution P , where the X i are covariates. The data generating processwas chosen such that there is limited overlap, i.e. P [ T = 1 | X = 0] ≈ Y (0) , Y (1)) ⊥ T | X (Rosenbaum and Rubin, 1983). As discussed above, the causal eﬀect can beestimated via doubly robust methods such as augmented inverse probabilityweighting, among others (Hernan and Robins, 2020). The data is generated11ccording to the following equations: (cid:15) Y ∼ N (0 , X ∼ Ber( . T ∼ (cid:40) Ber( .

7) if X = 1 , Ber( .

05) if X = 0 ,Y ( t ) = X t + 3 ts X + (cid:15) Y , (8)where s ∈ [0 , s = 0, the treatment eﬀect is homogeneous as τ ( x ) ≡ s = 0, the overlap-weighted eﬀect coincides with the average treatmenteﬀect. We can estimate that average treatment eﬀect via augmented inverse probabilityweighting (Robins et al., 1994), ˆ θ = ˆ µ − ˆ µ , where ˆ µ a = 1 n n (cid:88) i =1 Y i T i = a ˆ p ( T i | X i ) − T i = a − ˆ p ( T i | X i )ˆ p ( T i | X i ) ˆ Q ( X i , a ) , and where ˆ Q ( x, t ) is the empirical mean of Y given X = x and T = t and ˆ p ( ·|· )are empirical probabilities. Similarly as above, we can estimate the overlapeﬀect by augmented probability weighting,ˆ θ = ˆ η − ˆ η n (cid:80) i ˆ p ( T i = 1 | X i )(1 − ˆ p ( T i = 1 | X i )) , where ˆ η a = 1 n n (cid:88) i =1 Y i T i = a (1 − ˆ p ( T i | X i )) − (1 T i = a − ˆ p ( T i | X i ))(1 − ˆ p ( T i | X i )) ˆ Q ( X i , a ) . For w ∈ { / , . . . , / } we deﬁneˆ θ w = (1 − w )ˆ θ + w ˆ θ . (9)For s ≈

0, due to treatment eﬀect homogeneity we expect E [(ˆ θ − θ ) ] < E [(ˆ θ − θ ) ]. For s ≈

1, we expect E [(ˆ θ − θ ) ] > E [(ˆ θ − θ ) ]. In the ﬁrstcase, the optimal estimator is ˆ θ w with w ≈

0. In the second case the optimalestimator is ˆ θ w with w ≈

1. 12 .0 0.2 0.4 0.6 0.8 1.0 . . . . . s a v e r age M SE targeted selectioncross−validationAIPW ATEAIPW overlap Figure 1: Mean-squared error R ( ˆ w ) where ˆ w is selected via the cross-validationcriterion or targeted selection. The data is drawn according to equation (8).Cross-validation and targeted selection are used to shrink between the AIPWATE estimator and the AIPW overlap estimator. The proposed method per-forms equal or better than cross-validation across most s ∈ [0 , The mean-squared error of the estimator selected by targeted selection and 10-fold cross-validation is depicted in Figure 1. Results are averaged across 200simulation runs. For s < .

3, targeted selection outperforms the baseline esti-mator ˆ θ . For . < s < .

8, targeted selection performs worse than ˆ θ . Overalmost the entire range s ∈ [0 , s ∈ [0 , s ∈ [0 , . The instrumental variables approach is a widely-used method to estimate causaleﬀect of a treatment T on a target outcome Y in the presence of confounding(Wright, 1928; Bowden and Turkington, 1990; Angrist et al., 1996). Roughlyspeaking, the method relies on a predictor I (called the instrument) of thetreatment T that is not associated with the error term of the outcome Y . We13ill not discuss the assumptions behind instrumental variables in detail, butrefer the interested reader to Hernan and Robins (2020). We will focus onthe case, where I , T and Y are one-dimensional. Under IV assumptions andlinearity, the target quantity can be re-written as θ = E [ Y (1) − Y (0)] = Cov( I, Y )Cov(

I, T ) . Estimating this quantity can be challenging if the instrument is weak, i.e. ifCov(

I, T ) ≈

0. In this case, the approach can beneﬁt from shrinkage towardsthe ordinary least-squares solution (Nagar, 1959; Theil, 1961; Rothenh¨ausleret al., 2020; Jakobsen and Peters, 2020). Doing so may decrease the variancebut generally introduces bias. We will focus on the case where we have someadditional observational data, where we observe T and Y , but where the instru-ment I is unobserved. We draw 500 i.i.d. observations according to the following equations:

I, H, (cid:15) T , (cid:15) Y ∼ N (0 , T = I H + (cid:15) T Y ( t ) = t − s H + (cid:15) Y (10)We vary s ∈ [0 , X and Y . We observe ( T i , Y i ( T i ) , I i ) for i = 1 , . . . , i = 501 , . . . , X and Y , but not theinstrument I . Formally, for i = 501 , . . . , T i , Y i ( T i )) drawnaccording to equation (10). In the linear case, the instrumental variables approach can be written asˆ b IV = ˆCov( I, Y )ˆCov(

I, T ) , where ˆCov denotes the empirical covariance over the observations i = 1 , . . . , b OLS = argmin b min c ˆ E [( Y − T b − c ) ] , where ˆ E denotes the empirical expectation over the observations i = 1 , . . . , .0 0.5 1.0 1.5 2.0 . . . . . . . s a v e r age M SE targeted selectioncross−validationIVOLS Figure 2: Mean-squared error R ( ˆ w ) where ˆ w is selected via the cross-validationcriterion or targeted selection. The data is drawn according to equation (10).Cross-validation and targeted selection is used to stabilize the instrumentalvariables approach by shrinking the estimate towards ordinary least-squares.The proposed method performs equal or better than cross-validation across all s ∈ [0 , s (cid:54) = 0, but potentially decreases variance. As candidate estimators, for any w ∈ { / , / , . . . , / } we consider convex combinations of OLS and IV,ˆ θ w = w ˆ b OLS + (1 − w )ˆ b IV . The mean-squared error of the estimator selected by targeted selection and 10-fold cross-validation is depicted in Figure 2. Results are averaged across 200simulation runs. Over the entire range s ∈ [0 , s ≈

0, targeted selection outperforms the IVapproach. For s > .

5, the proposed approach performs worse than the IV ap-proach. For s ≈

2, the proposed approach performs similar to the IV approach.We evaluated the realized coverage of conﬁdence intervals with nominal coverage95% as described in Section 3.5. Across s ∈ [0 , . . s ∈ [0 , . .3 Experiment with proxy outcome One of the most popular estimators for causal eﬀects in experimental settingsis diﬀerence-in-means. To improve variance, it is possible to adjust for pre-treatment covariates, see for example Lin (2013); Deng et al. (2013). Thisraises the question whether post-treatment covariates can be used to improvethe precision of causal eﬀect estimates. This is indeed the case under additionalassumptions. For example, in some cases, the treatment eﬀect can be writtenas the product θ = E [ Y | T = 1] − E [ Y | T = 0] = θ T → P · θ P → Y , (11)where θ T → P = E [ P | T = 1] − E [ P | T = 0] is the eﬀect of the treatment on somesurrogate or proxy outcome P ∈ { , } ; and θ P → Y = E [ Y | P = 1] − E [ Y | P = 0]is the eﬀect of the proxy on the outcome. It is well-known that estimators thatmake use of such decompositions can outperform the standard diﬀerence-in-means estimator in terms of asymptotic variance (Tsiatis, 2007; Athey et al.,2019; Guo and Perkovi´c, 2020). However, doing so can introduce bias if equa-tion (11) does not hold. We will use the proposed model selection procedureto shrink between diﬀerence-in-means and an estimator that is unbiased if thetreatment eﬀect decomposition in equation (11) holds. We consider a simple experimental setting with a post-treatment variable P .For simplicity, let us consider an experiment with binary treatment T ∈ { , } ,a binary proxy outcome P ∈ { , } and outcome Y . We draw 200 i.i.d. obser-vations according to the following equations: T ∼ Ber( . (cid:15) P , (cid:15) Y ∼ N (0 , P ( t ) = 1 (cid:15) p ≤ t Y ( t ) = 12 P ( t ) + s t + (cid:15) Y (12)For s = 0, the outcome Y ( T ) is conditionally independent of the treatment,given the proxy P ( T ). In this case, the average treatment eﬀect can be writtenin product form, θ = θ T → P · θ P → Y , and this decomposition can be leveragedfor estimation. For s (cid:54) = 0, this decomposition does not hold. The standard estimator to estimate causal eﬀects from experiments is diﬀerence-in-means, ˆ θ = 1 (cid:80) T i (cid:88) i : T i =1 Y i − (cid:80) (1 − T i ) (cid:88) i : T i =0 Y i .

16f the proxy outcome is a valid surrogate, i.e. if Y ⊥ T | P, we can rewrite θ as θ = E [ Y | T = 1] − E [ Y | T = 0]= E [ E [ Y | P = 1] P + E [ Y | P = 0](1 − P ) | T = 1] − E [ E [ Y | P = 1] P + E [ Y | P = 0](1 − P ) | T = 0]= ( E [ Y | P = 1] − E [ Y | P = 0]) ( E [ P | T = 1] − E [ P | T = 0])= θ T → P · θ P → Y . Thus, in this case, we can also consider the product estimatorˆ θ = (cid:32) (cid:80) T i (cid:88) i : T i =1 P i − (cid:80) (1 − T i ) (cid:88) i : T i =0 P i (cid:33) · (cid:32) (cid:80) P i (cid:88) i : P i =1 Y i − (cid:80) (1 − P i ) (cid:88) i : P i =0 Y i (cid:33) We shrink between these two estimators, i.e. for w ∈ { / , . . . , / } we deﬁneˆ θ w = (1 − w )ˆ θ + w ˆ θ . The mean-squared error of the estimator selected by targeted selection and 10-fold cross-validation is depicted in Figure 3. Results are averaged across 200simulation runs. Similarly as above, targeted selection performs similar or bet-ter than cross-validation. Targeted selection performs better than the baselinemodel for s < .

4. The proposed procedure performs worse than the baselineprocedure in the regime s ∈ [ . , s ≥

1, targeted selection approachesthe performance of diﬀerence-in-means. We evaluated the realized coverage ofconﬁdence intervals with nominal coverage 95% as described in Section 3.5.Across s ∈ [0 , . s ∈ [0 , . We have introduced a method that allows to conduct targeted parameter se-lection by estimating the bias and variance of candidate estimators. The the-oretical justiﬁcation of the method relies on a linear asymptotic expansion ofthe estimator. The method is very general and can be used in both parametricand semi-parametric settings. Under regularity conditions, we showed that the17 .0 0.2 0.4 0.6 0.8 1.0 1.2 . . . . . s a v e r age M SE targeted selectioncross−validationdifference−in−meansproxy model Figure 3: Mean-squared error R ( ˆ w ) where ˆ w is selected via the cross-validationcriterion or the targeted selection. The data is drawn according to equation (12).Cross-validation and targeted selection is used to stabilize the diﬀerence-in-means estimator by shrinking towards an estimator that makes use of a proxyoutcome. The proposed method performs similar or better than cross-validationacross s ∈ [0 , . n → ∞ ,the modiﬁed risk criterion selects models with lower or equal risk than the base-line estimator ˆ θ . Furthermore, we discussed a computational shortcut to obtainbootstrap conﬁdence intervals.In simulations, we showed that the method selects reasonable models andoutperforms the cross-validation procedure in most scenarios. The proposedmethod can decrease variance if the competing estimators are approximatelyunbiased. However, there is no free lunch. In transitional regimes, for ﬁxed n , the estimator can perform worse than the baseline estimator ˆ θ . This isto be expected from statistical theory, see for example the discussion of theHodge-Le Cam estimator in (Van der Vaart, 2000).The theoretical justiﬁcation of the proposed method relies on a linear ap-proximation of the estimator in a neighborhood of the parameter values θ g .Thus, it would be important to understand the performance of the method inscenarios where parameter estimates of some of the estimators are far from theparameter values. To quantify uncertainty, we develop a modiﬁed bootstrapprocedure. If the procedure is tasked to select among a large number of esti-mators, a selective inference approach might be more appropriate (Taylor andTibshirani, 2015). In Section 4.2, we have seen some preliminary evidence thatthe proposed methodology may be used to combine knowledge across data sets.The proposed method is not tailored to this special case. Thus, we believe thatit would be exciting to investigate whether the model selection can be furtherimproved for data fusion tasks. Acknowledgements

The author would like to thank Guillaume Basse, Peter B¨uhlmann, Guido Im-bens, Nicolai Meinshausen, Jonas Peters, and Bin Yu for inspiring discussions.19

Appendix

This appendix contains proofs for the theoretical results in the main paper.

Proof.

Case 1: θ = θ g First, note that as E [ e g ( n ) ] = o (1 /n ) and as E [ˆ θ g ] − θ g = o (1 / √ n ), E [(ˆ θ g − θ ) ] = 1 n Var( ψ g ) + o (cid:18) n (cid:19) . By assumption, n ˆVar(ˆ θ g − ˆ θ ) − Var( ψ g − ψ ) = o P (1) . Similarly, n ˆVar(ˆ θ g ) − Var( θ g ) = o P (1) . Combining these equations with the deﬁnition of ˆ R ( g ),ˆ R ( g ) = ( 1 n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i )) − n Var( ψ g − ψ ) + 1 n Var( ψ g ) + o P (1 /n ) . Using the Central Limit Theorem, √ n (cid:80) ni =1 ψ g ( D i ) − ψ ( D i ) converges to aGaussian random variable with mean zero and variance Var( ψ g − ψ ). Thus, n (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = n  n (cid:32) √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) (cid:33) − n Var( ψ g − ψ )  + o P (1)= (cid:32) √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) (cid:33) − Var( ψ g − ψ ) + o P (1)converges to a random variable with mean zero. This concludes the proof of thecase θ g = θ . Case 2: θ g (cid:54) = θ Similarly as above, we can show thatˆ R ( g ) = ( θ g − θ ) + 2( θ g − θ ) 1 n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) + o P (1 / √ n ) , and E [(ˆ θ g − θ ) ] = ( θ g − θ ) + o (1 / √ n ) . √ n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ]) = 2( θ g − θ ) 1 √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) + o P (1) (13)Using the CLT, √ n (cid:80) ni =1 ψ g ( D i ) − ψ ( D i ) converges to a centered Gaussianrandom variable with variance Var( ψ g ( D ) − ψ ( D )). Using this fact in equa-tion (13) concludes the proof. Proof.

Case 1: θ g = θ . Let S k = { i : D i ∈ D ,k } . Analogously as in the proofof Theorem 1, it follows that n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]) = n  K (cid:88) k  | S k | (cid:88) i ∈ S k ψ g ( D i ) − n − | S k | (cid:88) i (cid:54)∈ S k ψ ( D i )  − n Var( ψ g )  + o P (1) . Recall that | S k | ∼ n ( K − /K . Using the CLT, for every k = 1 , . . . , K , √ n  | S k | (cid:88) i ∈ S k ψ g ( D i ) − n − | S k | (cid:88) i (cid:54)∈ S k ψ ( D i )  converges to a centered Gaussian random variable with variance11 − α Var( ψ g ( D )) + 1 α Var( ψ ( D )) , where α = K . Hence, n  K (cid:88) k  | S k | (cid:88) i ∈ S k ψ g ( D i ) − n − | S k | (cid:88) i (cid:54)∈ S k ψ ( D i )  − n Var( ψ g )  converges to a random variable with asymptotic mean α (1 − α ) Var( ψ g ( D )) + 1 α Var( ψ ( D )) . As Var( ψ g ( D )) (cid:54) = Var( ψ ( D )), the asymptotic mean of n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ])is a nonzero quantity that depends on g . This concludes the proof of case 1. Case 2: θ (cid:54) = θ Similarly as above, we can show that˜ R ( g ) = ( θ g − θ ) + 2( θ g − θ ) 1 n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) + o P (1 / √ n ) . √ n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]) = 2( θ g − θ ) 1 √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) + o P (1) . (14)Using the CLT, √ n (cid:80) ni =1 ψ g ( D i ) − ψ ( D i ) converges to a centered Gaussianrandom variable with variance Var( ψ g ( D ) − ψ ( D )). Using this fact in equa-tion (14) concludes the proof. Proof.

Case 1: θ g = θ Let S k = { i : D i ∈ D ,k } . Inspecting the proof of Theorem 1 we obtain that n (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = (cid:32) √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) (cid:33) − Var( ψ g − ψ ) + o P (1) . Multiplying with α = 1 /K , we can rewrite this as nK (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = (cid:32) √ K √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) (cid:33) − K Var( ψ g − ψ ) + o P (1) . Writing Z k = √ αn (cid:80) i (cid:54)∈ S k ψ g ( D i ) and X k = √ αn (cid:80) i (cid:54)∈ S k ψ ( D i ) and using α = K we obtain nK (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = (cid:32) K K (cid:88) k =1 Z k − X k (cid:33) − K Var( ψ g − ψ ) + o P (1) . Similarly, by inspecting the proof of Theorem 2, n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]) = n  K (cid:88) k  | S k | (cid:88) i ∈ S k ψ g ( D i ) − n − | S k | (cid:88) i (cid:54)∈ S k ψ ( D i )  − n Var( ψ g )  + o P (1) .

22e have ( n − | S k | ) ∼ αn , and | S k | ∼ ( K − αn . Thus, multiplying with α = 1 /K we can rewrite this as nK ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]) = αn (cid:32) K (cid:88) k  K − αn (cid:88) i ∈ S k ψ g ( D i ) − αn (cid:88) i (cid:54)∈ S k ψ ( D i )  − n Var( ψ g ) (cid:33) + o P (1) . = 1 K (cid:88) k  K − √ αn (cid:88) i ∈ S k ψ g ( D i ) − √ αn (cid:88) i (cid:54)∈ S k ψ ( D i )  − K Var( ψ g ) + o P (1)= 1 K (cid:88) k  K − (cid:88) j (cid:54) = k Z j − X k  − K Var( ψ g ) + o P (1) . Thus, to complete the proof, we have to show that the asymptotic variance ofthe ﬁrst term is larger than the asymptotic variance of the second term: nK ( ˜ R ( g ) − E [(ˆ θ g − θ ) ])= 1 K (cid:88) k  K − (cid:88) j (cid:54) = k Z j − X k  − K Var( ψ g ) + o P (1) nK (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = (cid:32) K K (cid:88) k =1 Z k − X k (cid:33) − K Var( ψ g − ψ ) + o P (1) . Using the CLT, for n → ∞ and K ﬁxed, ( X , . . . , X K , Z , . . . , Z K ) converge toa centered multivariate Gaussian vector with non-degenerate variance. Hence,without loss of generality in the following we can assume that the vector( X , . . . , X K , Z , . . . , Z K )is multivariate Gaussian. Recall that we assume | Cor( ψ g , ψ ) | (cid:54) = 1. Thus wealso have | Cor( Z j , X j ) | (cid:54) = 1. Using Lemma 1 completes the proof of case 1. Case 2: θ g (cid:54) = θ Inspecting the proofs of Theorem 1 and Theorem 2, we obtain that in thecase θ g = θ , the asymptotic variance of √ n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ] is4( θ g − θ ) Var( ψ g ( D ) − ψ ( D )) , (15)23hich is the same as the asymptotic variance of √ n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]). Thisconcludes the proof. Lemma 1.

Let ( Z i , X i ) be i.i.d. Gaussian with mean zero and nonzero varianceand | Cor( Z i , X i ) | (cid:54) = 1. Let K ≥

2. Then,Var (cid:32) K K (cid:88) i =1 Z i − X i (cid:33)  < Var  K K (cid:88) i =1  K − (cid:88) j (cid:54) = i Z j − X i   . (16) Proof.

As ( Z i , X i ) are multivariate Gaussian and | Cor( Z i , X i ) | (cid:54) = 1, X i = αZ i + (cid:15) i , for some α ∈ R , where ( (cid:15) i ) i is centered Gaussian and independent of ( Z i ) i with nonzero variance. Thus, it suﬃces to show thatVar (cid:32) K K (cid:88) i =1 Z i − αZ i − (cid:15) i (cid:33)  < Var  K K (cid:88) i =1  K − (cid:88) j (cid:54) = i Z j − αZ i − (cid:15) i   . (17)On the left-hand side of equation (17), we haveVar (cid:32) K K (cid:88) i =1 Z i − αZ i (cid:33) − K K (cid:88) i =1 ( Z i − αZ i ) 1 K K (cid:88) i =1 (cid:15) i + (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33)  = Var (cid:32) K K (cid:88) i =1 Z i − αZ i (cid:33)  + 4Var (cid:32) K K (cid:88) i =1 ( Z i − αZ i ) (cid:33) Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) + Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33)  On the right-hand side of equation (17) we haveVar  K K (cid:88) i =1  K − (cid:88) j (cid:54) = i Z j − αZ i  − K K (cid:88) i =1 (cid:15) i K − (cid:88) j (cid:54) = i Z j − αZ i + 1 K K (cid:88) i =1 ( (cid:15) i )  = Var  K K (cid:88) i =1  K − (cid:88) j (cid:54) = i Z j − αZ i   + 4Var  K K (cid:88) i =1 (cid:15) i K − (cid:88) j (cid:54) = i Z j − αZ i  + Var (cid:32) K K (cid:88) i =1 ( (cid:15) i ) (cid:33) . (cid:32) K K (cid:88) i =1 Z i − αZ i (cid:33)  ≤ Var  K K (cid:88) i =1  K − (cid:88) j (cid:54) = i Z j − αZ i   Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33)  < Var (cid:32) K K (cid:88) i =1 ( (cid:15) i ) (cid:33) (cid:32) K K (cid:88) i =1 ( Z i − αZ i ) (cid:33) Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) ≤  K K (cid:88) i =1 (cid:15) i K − (cid:88) j (cid:54) = i Z j − αZ i  The ﬁrst two inequalities follow from Lemma 2 and Lemma 3. Let us now dealwith the last term.Var  K K (cid:88) i =1 (cid:15) i K − (cid:88) j (cid:54) = i Z j − αZ i  − Var (cid:32) K K (cid:88) i =1 ( Z i − αZ i ) (cid:33) Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) = (cid:88) i (cid:54) = j Var (cid:18) K (cid:15) i K − Z j (cid:19) + (cid:88) i Var (cid:18) K α(cid:15) i Z i (cid:19) − (1 − α ) K Var( Z )Var( (cid:15) )= Var( (cid:15) )Var( Z ) (cid:18) K ( K −

1) + α K − (1 − α ) K (cid:19) Thus, it suﬃces to show that1 K ( K − − K + α K − α K + 2 α K ≥ K ( K −

1) + α ( K − K + 2 α K ≥ . (18)Dividing by ( K − /K yields1( K − + α + 2 α ( K − ≥ . The left term can be written as (cid:18) K −

1) + α (cid:19) . This shows the inequality in equation (18), which completes the proof.25 emma 2.

Let (cid:15) i be i.i.d. centered Gaussian random variables with nonzerovariance and K ≥

2. Then,Var (cid:32) K (cid:88) i (cid:15) i (cid:33)  < Var (cid:32) K K (cid:88) i =1 ( (cid:15) i ) (cid:33) (19) Proof.

On the left-hand side of equation (19) we have1 K Var( (cid:15) ) . On the right-hand side of equation (19) we have1 K Var( (cid:15) ) . This completes the proof.

Lemma 3.

Let Z i be i.i.d. centered Gaussian random variables and K ≥ (cid:32) K (cid:88) i Z i − αZ i (cid:33)  ≤ Var  K K (cid:88) i =1  K − (cid:88) j (cid:54) = i Z j − αZ i   . Proof.

It suﬃces to show thatVar (cid:32) √ K (cid:88) i Z i − αZ i (cid:33)  ≤ Var  K (cid:88) i =1  K − (cid:88) j (cid:54) = i Z j − αZ i   . (20)On the left-hand side we have (1 − α ) Var( Z ) . On the right-hand side of equation (20) we haveVar  K − (cid:88) i Z i + ( K − K − (cid:88) i>j Z i Z j − (cid:88) i>j α ( K −

1) 4 Z i Z j + α K (cid:88) i =1 Z i  = K ( 1( K −

1) + α ) Var( Z ) + ( ( K − K − − α ( K −

1) ) K ( K − Z )= Var( Z )( Kα − α ( K − K ( K − + α (2 K ( K −

1) + 4 K ( K −

1) )+ K ( K −

1) + K ( K − ( K − ) 26sing the two inequalities above, it suﬃces to show that for all α and all K ≥ Kα + 6 α K ( K − − α ( K − K ( K − + K ( K −

1) + K ( K − ( K − ≥ α − α + 6 α − α + 1Rearranging, it suﬃces to show that( K − α + 4 α + 6 α K −

1) + α K − + K ( K −

1) + K ( K − − ( K − ( K − ≥ . Multiplying with ( K − K − α + 4 α ( K −

1) + 6 α + α K −

1) + K ( K −

1) + K ( K − − ( K − ( K − ≥ , which is equivalent to( K − α + 4 α ( K −

1) + 6 α + α K − K − K + K − K + 4 K − K + 3 K − K + 1( K − ≥ . This inequality is equivalent to( K − α + 4 α ( K −

1) + 6 α + α K −

1) + 1( K − ≥ . (21)Rearranging the left-hand side, we obtain( K − (cid:18) α + 1( K − (cid:19) . This proves the inequality of equation (21), which completes the proof.

Proof.

By assumption, we have ˆ θ g − ˆ θ = θ g − θ + O P (1 / √ n ). If θ g − θ (cid:54) = 0,ˆ R mod ( g ) = ( θ g − θ ) + O P (1 / √ n ) . On the other hand, if θ g = θ ,ˆ R mod ( g ) = O P (1 /n ) . P [ θ ˆ g = θ ] → . Now consider any g with θ g = θ and Var( ψ g ) > Var( ψ ). Then,ˆ R mod ( g ) ≥ Var(ˆ θ g ) + o P (1 /n ) , and ˆ R mod (0) = Var(ˆ θ ) + o P (1 /n ) . By assumption, Var(ˆ θ g ) = n Var( ψ g )+ o (1 /n ) and Var(ˆ θ ) = n Var( ψ )+ o (1 /n ).Furthermore, by assumption Var( ψ g ) > Var( ψ ). Thus, lim n n ˆ R mod (0) < lim n n ˆ R mod ( g ) for n → ∞ . As this holds for all g with Var( ψ g ) > Var( ψ )for n → ∞ , this concludes the proof. 28 eferences Akaike, H. (1974). A new look at the statistical model identiﬁcation. In

SelectedPapers of Hirotugu Akaike , pages 215–222. Springer.Angrist, J., Imbens, G., and Rubin, D. (1996). Identiﬁcation of causal eﬀectsusing instrumental variables.

Journal of the American Statistical Association ,91(434):444–455.Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures formodel selection.

Statistics surveys .Athey, S., Chetty, R., and Imbens, G. (2020). Combining experimental andobservational data to estimate treatment eﬀects on long term outcomes. arXivpreprint arXiv:2006.09676 .Athey, S., Chetty, R., Imbens, G. W., and Kang, H. (2019). The surrogateindex: Combining short-term proxies to estimate long-term treatment eﬀectsmore rapidly and precisely. Technical report, National Bureau of EconomicResearch.Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causaleﬀects.

Proceedings of the National Academy of Sciences .Bareinboim, E. and Pearl, J. (2016). Causal inference and the data-fusion prob-lem.

Proceedings of the National Academy of Sciences , 113(27):7345–7352.Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selection inference.

The Annals of Statistics , 41(2):802–837.Bowden, R. and Turkington, D. (1990).

Instrumental variables , volume 8. Cam-bridge university press.Brookhart, M. and Van Der Laan, M. (2006). A semiparametric model selectioncriterion with applications to the marginal structural model.

Computationalstatistics & data analysis , 50(2):475–498.Claeskens, G. and Hjort, N. (2003). The focused information criterion.

Journalof the American Statistical Association .Claeskens, G. and Hjort, N. (2008).

Model selection and model averaging . Cam-bridge University Press.Crump, R., Hotz, V., Imbens, G., and Mitnik, O. (2006). Moving the goalposts:Addressing limited overlap in the estimation of average treatment eﬀects bychanging the estimand. Technical report, National Bureau of Economic Re-search.Cui, Y. and Tchetgen-Tchetgen, E. (2019). Bias-aware model selection for ma-chine learning of doubly robust functionals. arXiv preprint arXiv:1911.02029 .29eng, A., Xu, Y., Kohavi, R., and Walker, T. (2013). Improving the sensitivityof online controlled experiments by utilizing pre-experiment data. In

Pro-ceedings of the sixth ACM international conference on Web search and datamining , pages 123–132.Friedman, J., Hastie, T., and Tibshirani, R. (2001).

The elements of statisticallearning . Springer.Guo, F. and Perkovi´c, E. (2020). Eﬃcient least squares for estimating total ef-fects under linearity and causal suﬃciency. arXiv preprint arXiv:2008.03481 .Heckman, J., Ichimura, H., and Todd, P. (1998). Matching as an econometricevaluation estimator.

The review of economic studies , 65(2):261–294.Hernan, M. and Robins, J. (2020).

Causal inference: What If . Chapman &Hall.Imbens, G. and Wooldridge, J. (2009). Recent developments in the econometricsof program evaluation.

Journal of economic literature , 47(1):5–86.Jakobsen, M. and Peters, J. (2020). Distributional robustness of k-class estima-tors and the pulse. arXiv preprint arXiv:2005.03353 .Kapelner, A., Bleich, J., Levine, A., C., Z. D., DeRubeis, R., and Berk, R.(2014). Inference for the eﬀectiveness of personalized medicine with software. arXiv preprint arXiv:1404.7844 .LaLonde, R. (1986). Evaluating the econometric evaluations of training pro-grams with experimental data.

The American economic review , pages 604–620.Lin, W. (2013). Agnostic notes on regression adjustments to experimental data:Reexamining freedman’s critique.

The Annals of Applied Statistics , 7(1):295–318.Nagar, A. (1959). The bias and moment matrix of the general k-class estimatorsof the parameters in simultaneous equations.

Econometrica , pages 575–595.Nie, X. and Wager, S. (2017). Quasi-oracle estimation of heterogeneous treat-ment eﬀects. arXiv preprint arXiv:1712.04912 .Peters, J., B¨uhlmann, P., and Meinshausen, N. (2016). Causal inference byusing invariant prediction: identiﬁcation and conﬁdence intervals.

Journal ofthe Royal Statistical Society: Series B , 78(5).Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N., Hastie, T., and Tibshirani,R. (2018). Some methods for heterogeneous treatment eﬀect estimation inhigh dimensions.

Statistics in medicine .30obins, J., Rotnitzky, A., and Zhao, L. (1994). Estimation of regression coeﬃ-cients when some regressors are not always observed.

Journal of the AmericanStatistical Association .Rolling, C. and Yang, Y. (2014). Model selection for estimating treatmenteﬀects.

Journal of the Royal Statistical Society: Series B .Rosenbaum, P. and Rubin, D. (1983). Assessing sensitivity to an unobservedbinary covariate in an observational study with binary outcome.

Journal ofthe Royal Statistical Society: Series B , 45(2):212–218.Rothenh¨ausler, D., Meinshausen, N., B¨uhlmann, P., and Peters, J. (2020).Anchor regression: heterogeneous data meets causality. arXiv preprintarXiv:1801.06229 .Rubin, D. (1974). Estimating causal eﬀects of treatments in randomized andnonrandomized studies.

Journal of educational Psychology , 66(5):688.Schuler, A., Baiocchi, M., Tibshirani, R., and Shah, N. (2018). A comparisonof methods for model selection when estimating individual treatment eﬀects. arXiv preprint arXiv:1804.05146 .Schwarz, G. (1978). Estimating the dimension of a model.

Annals of Statistics ,6(2).Splawa-Neyman, J., Dabrowska, D., and Speed, T. (1990). On the applicationof probability theory to agricultural experiments.

Statistical Science , pages465–472.Taylor, J. and Tibshirani, R. (2015). Statistical learning and selective inference.

Proceedings of the National Academy of Sciences , 112(25):7629–7634.Theil, H. (1961).

Economic forecasts and policy . North-Holland Pub. Co.Tsiatis, A. (2007).

Semiparametric theory and missing data . Springer Science& Business Media.Van der Laan, M. and Robins, J. (2003).

Uniﬁed methods for censored longitu-dinal data and causality . Springer.Van der Vaart, A. (2000).

Asymptotic statistics . Cambridge university press.Wright, P. (1928).

Tariﬀ on animal and vegetable oils . Macmillan Company,New York.Zhao, Y., Fang, X., and Simchi-Levi, D. (2017). Uplift modeling with multipletreatments and general response types. In