MModel selection for estimation of causalparameters
Dominik Rothenh¨auslerStanford UniversitySeptember 1, 2020
Abstract
A popular technique for selecting and tuning machine learning esti-mators is cross-validation. Cross-validation evaluates overall model fit,usually in terms of predictive accuracy. This may lead to models thatexhibit good overall predictive accuracy, but can be suboptimal for esti-mating causal quantities such as the average treatment effect.We propose a model selection procedure that estimates the mean-squared error of a one-dimensional estimator. The procedure relies onknowing an asymptotically unbiased estimator of the parameter of inter-est. Under regularity conditions, we show that the proposed criterion hasasymptotically equal or lower variance than competing procedures basedon sample splitting.In the literature, model selection is often used to choose among mod-els for nuisance parameters but the identification strategy is usually fixedacross models. Here, we use model selection to select among estimatorsthat correspond to different estimands. More specifically, we use modelselection to shrink between methods such as augmented inverse probabil-ity weighting, regression adjustment, the instrumental variables approach,and difference-in-means. The performance of the approach for estimationand inference for average treatment effects is evaluated on simulated datasets, including experimental data, instrumental variables settings and ob-servational data with selection on observables.
Model selection is a fundamental task in statistical practice. Usually, the aim isto find a model that optimizes overall model fit. If the loss function is quadratic,the total error can be decomposed into the error due to variance and the er-ror due to bias. A popular technique to balance the bias-variance trade-off inthe context of prediction is cross-validation. However, models that have highpredictive accuracy do not always yield accurate parameter estimates. In theliterature, there exist some solutions for parameter-specific model selection, but1 a r X i v : . [ s t a t . M E ] A ug e currently lack a reliable general-purpose tool. In the context of causal infer-ence, a reliable parameter-specific model selection tool could enable the followingapplications. Estimating average treatment effects from observational data can be unreliable,if conditional treatment assignment probabilities are close to zero or one. Toaddress this issue, some applied researchers “move the goalpost” by switchingto causal contrasts that can be estimated more reliably, see LaLonde (1986),Heckman et al. (1998), Crump et al. (2006) and references therein. For example,instead of estimating the average treatment effect (ATE), a practitioner mightswitch to estimating the average treatment effect on the treated (ATT), orother causal contrasts such as the overlap effect. This is often done in an ad-hoc fashion. Using a model selection tool in this context could allow to tradeoff bias and variance when switching from estimating the ATE to estimators ofother causal contrasts.
Combining evidence across data sets in the context of causal inference has re-cently attracted increasing interest (Peters et al., 2016; Bareinboim and Pearl,2016; Athey et al., 2020). However, data quality often varies from data set todata set. In this case, using all data sets can lead to untrustworthy, biasedestimators. A reliable model selection tool could make it possible to distinguishwhich data sets and estimators are useful for solving the estimation problem athand, and which are not.
Researchers are often uncomfortable with using statistical procedures that onlywork under strong assumptions. Using such methods over other proceduresmay introduce some bias if the assumptions are violated, but has the potentialto reduce variance. For example, conditional independence assumptions can beleveraged to improve precision of treatment effect estimators (Athey et al., 2019;Guo and Perkovi´c, 2020). A model selection tool as described above would allowto systematically trade off bias and variance when switching between estimationprocedures that work under different sets of assumptions. Doing so can poten-tially improve precision in scenarios where researchers are not comfortable withmaking strong assumptions.
Model selection has a long history in statistics and machine learning. For opti-mizing loss-based estimators, the most commonly used methods include cross-validation, the Akaike information criterion, and the Bayesian information cri-2erion (Akaike, 1974; Schwarz, 1978; Friedman et al., 2001; Arlot and Celisse,2010).The focused information criterion is a model selection criterion which, for agiven focus parameter, estimates the mean-squared error of submodels (Claeskensand Hjort, 2003, 2008). It relies on knowing an asymptotically unbiased estima-tor of the parameter of interest. Its theoretical justification is given in a localmisspecification framework.More recently, Cui and Tchetgen-Tchetgen (2019) introduce a model selec-tion tool for finite-dimensional functionals in a semiparametric model if a doublyrobust estimation function is available. It is based on a pseudo-risk criterionthat has a robustness property if one of the estimators is biased.For the task of model selection when estimating heterogeneous treatmenteffects, several methods have been developed (Kapelner et al., 2014; Rolling andYang, 2014; Athey and Imbens, 2016; Nie and Wager, 2017; Zhao et al., 2017;Powers et al., 2018). Most of the methodologies are specific to the consideredmodel class. A comparison of this line of work for individual treatment effectscan be found in Schuler et al. (2018).Van der Laan and Robins (2003) propose a loss-based approach for parameter-specific model selection. In this work, the authors recommend minimizing anempirical estimate of the overall risk R (ˆ θ g , ˆ η ), where ˆ θ g , g = 1 , . . . , G are candi-date estimators and ˆ η is an efficient estimator of the nuisance parameter, com-puted on the training data set. Our approach is more generic in the sense thatwe do not assume that parameter of interest minimizes a known loss function.Closest to our work is the cross-validation criterion developed by Brookhartand Van Der Laan (2006). It proceeds as follows. The data D = ( D , . . . , D n )is randomly split into K disjoint roughly equally-sized folds D , , . . . , D ,K .Define D ,k = D \ D ,k . If the data is i.i.d., D ,k and D ,k are independent foreach k . Let ˆ θ be an unbiased estimator of the parameter of interest θ , and ˆ θ g be candidate estimators, g = 0 , . . . , G . Then, we can compute the risk criterion˜ R ( g ) = 1 K K (cid:88) k =1 (ˆ θ g ( D ,k ) − ˆ θ ( D ,k )) . (1)Using independence of D ,k and D ,k , E [ ˜ R ( g )] = E [(ˆ θ g ( D , ) − θ ) ] + Var(ˆ θ ( D , )) . As Var(ˆ θ ( D , )) is constant in g , the criterion in equation (1) can be used toselect an estimator ˆ θ g with low mean-squared error for estimating θ amongˆ θ g , g = 0 , . . . , G . The criterion in equation (1) is attractive as it is simple andwidely applicable. We will compare the proposed model selection criterion tothe criterion in equation (1), both in theoretical results and in simulations. In this paper, we work towards making parameter-specific model selection morereliable. We derive a model selection criterion that estimates the mean-squared3rror of a one-dimensional estimator. We show that the selected model hasequal or lower variance than the baseline estimator asymptotically. Comparedto the parameter-specific cross-validation approach developed by Brookhart andVan Der Laan (2006), we show that the proposed criterion has equal or lowervariance asymptotically. The proposed criterion is flexible in the sense that itcan be used for any one-dimensional estimator in both parametric and semi-parametric settings. Theoretical justification of the method is given under theassumption of asymptotic linearity.In previous work, the goal of model selection for one-dimensional estimationis usually to select a nuisance parameter model from a set of candidate models,but the identification strategy is held fixed across models (Van der Laan andRobins, 2003; Brookhart and Van Der Laan, 2006; Cui and Tchetgen-Tchetgen,2019). In this paper, the goal is to select among different estimands, or identifi-cation strategies. Mathematically, this correspond to selecting among different functionals of the underlying data generating distribution. In some situations,this results in dramatic improvements in the mean-squared error. However,there is no free lunch. Compared to the baseline procedure, for fixed n , modelselection can lead to increased risk in parts of the parameter space.In simulations, we demonstrate that the proposed criterion allows to trade offbias and variance in a variety of scenarios, including experiments, instrumentalvariables settings and data with selection on observables. The code can be foundat github.com/rothenhaeusler/tms. In Section 2, we introduce a method for parameter-specific model selection anddiscuss applications in causal inference. Theory for the method is discussed inSection 3. We evaluate the performance of the proposed procedure on simulateddata in Section 4.
This section consists of two parts. We briefly discuss the setting in Section 2.1.Then, we introduce the method in Section 2.2 and discuss basic properties.
We observe data D = ( D i , i = 1 , . . . , n ), where the D i are independentlydrawn from some unknown distribution P . Suppose we have access to estimatorsˆ θ g ( D ), g = 0 , . . . , G , of some unknown parameter θ . In the following, tosimplify notation, we will write ˆ θ g instead of ˆ θ g ( D ). We assume that the baselineestimator ˆ θ is asymptotically unbiased for θ , i.e. that E [ˆ θ ] = θ + o ( n − / ).In addition, for g (cid:54) = 0, we assume that E [ˆ θ g ] = θ g + o ( n − / ) for some unknown θ g . 4 .2 The method We aim to find the estimator g that minimizes R ( g ) = E [(ˆ θ g − θ ) ] . (2)Here and in the following, we suppress the dependence of R ( g ) on n . As biasand variance of ˆ θ g are unknown, the function R ( g ) is unknown and one hasto minimize a proxy of the risk R ( g ) instead. We propose to estimate R ( g ) inequation (2) via ˆ R ( g ) = (ˆ θ g − ˆ θ ) − ˆVar(ˆ θ g − ˆ θ ) + ˆVar(ˆ θ g ) , (3)where ˆVar(ˆ θ g − ˆ θ ) is an estimator of the variance of ˆ θ g − ˆ θ and ˆVar(ˆ θ g ) is anestimator of the variance of ˆ θ g . If explicit expressions for variance estimatorsof ˆ θ g − ˆ θ or ˆVar(ˆ θ g ) are not available, we recommend estimating the variancesvia the bootstrap. We propose to choose a final estimate ˆ θ ˆ g by solvingˆ g = arg min g ˆ R ( g ) . (4)In the following, we will sketch the theoretical justification for the approachin equation (4). The main reason to use the risk estimator in equation (3)over sample-splitting based methods is the potential reduction in asymptoticvariance. We will discuss this further in Section 3. The method is evaluated onsimulated data sets in Section 4. Under regularity conditions, we have E [ ˆVar(ˆ θ g − ˆ θ ) + ˆVar(ˆ θ g )] = Var(ˆ θ g − ˆ θ ) + Var(ˆ θ g ) + o (1 /n ) . (5)If θ g = θ , combining equation (3) with equation (5) yields E [ ˆ R ( g )] = Var(ˆ θ g − ˆ θ ) − Var(ˆ θ g − ˆ θ ) + Var(ˆ θ g ) + o (1 /n )= E [(ˆ θ g − θ ) ] + o (1 /n )= R ( g ) + o (1 /n ) . Note that in this case under regularity conditions usually R ( g ) = O (1 /n ). Sim-ilarly if θ g (cid:54) = θ , we obtain E [ ˆ R ( g )] = R ( g ) + o (1 / √ n ) . Under regularity conditions, in this case usually R ( g ) = O (1). Thus, we haveseen that equation (3) is approximately unbiased for the mean-squared errorof ˆ θ g if the approximation in equation (5) holds. As discussed in the sketch,(ˆ θ g − ˆ θ ) − Var(ˆ θ g − ˆ θ ) is an approximately unbiased estimator of the squared5ias ( θ g − θ ) . As we know that the squared bias is nonnegative, we can improveprecision by defining the modified risk criterionˆ R mod ( g ) = ((ˆ θ g − ˆ θ ) − ˆVar(ˆ θ g − ˆ θ )) + + ˆVar(ˆ θ g ) . (6)Then the final estimator ˆ θ ˆ g is chosen such that ˆ g minimizes equation (6). Ifthere are ties, we select ˆ g as the one that minimizes (ˆ θ g − ˆ θ ) among the g thatsatisfy ˆ R mod ( g ) = min g ˆ R mod ( g ). The criterion ˆ R mod ( g ) is not asymptotiallyunbiased for R ( g ), but has some favorable statistical properties that we willdiscuss in the following section. In this section we discuss the theoretical underpinnings and computational issuesof the method introduced in Section 2. First, we show that the criterion ˆ R ( g ) isasymptotically unbiased for estimating the mean-squared error R ( g ) = E [(ˆ θ g − θ ) ]. Secondly, we compare the criterion ˆ R ( g ) to cross-validation in termsof asymptotic bias and variance. Then, we discuss the asymptotic risk of theresulting estimator. Finally, we derive a computational shortcut for formingapproximate bootstrap confidence intervals. We have to make some assumptions that guarantee that the estimators areasymptotically well-behaved. We make two major assumptions, in addition tothe assumptions outlined in Section 2.1. The first major assumption is a slightlystronger version of asymptotic linearity. Asymptotic linearity is an assumptionthat is commonly made to justify asymptotic normality of an estimator (Van derVaart, 2000; Tsiatis, 2007). As our goal is to estimate the mean-squared error ofan estimator, we use a slightly stronger version that guarantees convergence ofsecond moments. The second major assumption is that the variance estimatesare consistent.
Assumption 1.
We make two major assumptions.1. Let ˆ θ g , g = 0 , . . . , G be estimators such thatˆ θ g − θ g = 1 n n (cid:88) i =1 ψ g ( D i ) + e g ( n )where ψ g ( D i ) is centered and has finite nonzero second moments, and E [ e g ( n ) ] = o (1 /n ). To avoid trivial special cases, in addition we assume | Cor( ψ g ( D ) , ψ ( D )) | (cid:54) = 1.2. The estimators of variance are consistent, that means n ˆVar(ˆ θ g − ˆ θ ) − lim n n Var(ˆ θ g − ˆ θ ) = o P (1) ,n ˆVar(ˆ θ g ) − lim n n Var(ˆ θ g ) = o P (1) . e g ( n ) = o P (1 /n ) while we assume that E [ e g ( n ) ] = o (1 /n ). Thus, our assumption is slightly stronger than asymptoticlinearity. Let us now turn to our main theoretical results. Our first result shows that the proposed criterion is asymptotically unbiasedfor the mean-squared error of ˆ θ g . The convergence rate depends on whetherˆ θ g is asymptotically biased for estimating θ . If the estimator ˆ θ g is unbiased,the convergence rate is faster than if the estimator is biased. The proof of thefollowing result can be found in the Appendix. Theorem 1 (Asymptotic unbiasedness of ˆ R ( g )) . Let Assumption 1 hold.1. If θ g = θ , n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ])converges to a random variable with zero mean.2. If θ g (cid:54) = θ , √ n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ])converges to a random variable with zero mean.This result shows that the estimator is asymptotically unbiased, which isimportant for the theoretical justification of the approach. The major strengthof the approach will become apparant in the next section, where we compareasymptotic bias and variance to the cross-validation criterion (1). In this section, we will show that the proposed criterion has asymptotically lowervariance than the cross-validation criterion (1) if θ g = θ . Furthermore, we showthat cross-validation is generally biased for estimating the mean-squared errorif θ g = θ . Let us first discuss asymptotic bias of the naive approach. The shortproof of this result can be found in the Appendix. Theorem 2 (Asymptotic biasedness of ˜ R ( g )) . Let Assumption 1 hold and fix K .1. If θ g = θ , and Var( ψ g ( D )) (cid:54) = Var( ψ ( D )), n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ])converges to a random variable with nonzero mean , where the mean de-pends on g . 7. If θ g (cid:54) = θ , √ n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ])converges to a random variable with zero mean.Comparing this result with Theorem 1, we see a major difference how themethods behave when θ g = θ . The proposed criterion is asymptotically un-biased for the mean-squared error, while cross-validation is not. The cross-validation approach computes the estimator on randomly chosen subsets of thedata. Thus, intuitively, this approach overestimates the asymptotic variance ofthe estimator. This defect of cross-validation could be solved by subtracting acorrection term.Now, let us turn to the asymptotic variance of the model selection criteria.The proof of the following result can be found in the Appendix. Theorem 3 (Asymptotic variance of model selection criteria) . Let Assump-tion 1 hold and fix K .1. If θ = θ g , the asymptotic variance of n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ]) is strictlylower than the asymptotic variance of n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]),2. If θ (cid:54) = θ g , the asymptotic variance of √ n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ]) is equalto the asymptotic variance of √ n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]).Roughly speaking, this theorem shows that the proposed criterion ˆ R ( g ) hasequal or lower asymptotic variance than the cross-validation criterion ˜ R ( g ).The difference in variance leads to substantially different resulting risk in somescenarios as we will see in the simulation section. The difference in variance isparticularly large if the splits are imbalanced, i.e. if the test data sets have muchsmaller or much larger sample size than the training data sets. Intuitively, ifthe validation data sets have small sample size, validation becomes unstable.However, if the validation data set is large, the estimator on the training datasets becomes unstable. This tradeoff can be avoided by estimating bias andvariance separately, as done in the proposed approach. Here and in the following, we focus on the case where the number of models G is small and fixed and n → ∞ . We are focusing on the case where G is small, aswe expect the method to be most reliable in scenarios where G is small. Thiswill be further discussed in Section 3.5. First, we investigate the asymptoticbehaviour of the proposed procedure in the case where the number of modelsis fixed and n → ∞ . The proof of the following result can be found in theAppendix. Corollary 1 (Asymptotic risk of selected model) . Let Assumption 1 hold. Con-sider a finite and fixed number of estimators g = 0 , . . . , G . Letˆ g = arg min ˆ R mod ( g ) . n → ∞ , P [ θ ˆ g = θ ] → , and P [ R (ˆ g ) ≤ R (0)] → . In words, for n → ∞ , the proposed method select models with lower orequal risk than the baseline estimator ˆ θ . Interestingly, an analogous result does not hold for the cross-validation procedure (1). Even for n → ∞ , the modelselected by cross-validation may have larger risk than the baseline procedureˆ θ with positive probability. Note that Corollary 1 does not state that theprocedure selects the optimal model asymptotically. This is indeed not true.It is straightforward to construct a model selection procedure that selects theoptimal model asymptotically by placing additional weight on the variance term.For example, one might consider ˆ R alt ( g ) = ˆ R mod ( g ) + √ n ˆVar(ˆ θ g ). However,in our experience, placing additional weight on the variance term can lead toproblematic performance compared to the baseline estimator ˆ θ in finite samplesand thus is not recommended. Deriving confidence intervals that are valid in conjuction with a model selectionstep is a challenging topic and has attracted substantial interest in recent years,see for example Berk et al. (2013); Taylor and Tibshirani (2015). Generallyspeaking, statistical inference after a model selection step can be unreliable ifthe uncertainty induced by the model selection step is ignored.Thus, in settings where the number of estimators G is small and n is large,we recommend forming confidence intervals by bootstrapping the model selec-tion procedure. More specifically, we recommend creating bootstrap samples D ∗ , . . . , D ∗ B . Let ˆ g ( D ∗ b ) = arg min ˆ R mod ( g ), where ˆ R mod ( g ) is computed on thebootstrap sample D ∗ b . Let ˆ θ g ( D ∗ b ) be the estimator ˆ θ g , computed on the data set D ∗ b . Then we can define the final estimate ˆ θ ∗ ,b = ˆ θ ˆ g ( ˆ D ∗ b ) ( D ∗ b ) for all b = 1 , . . . , B .The lower and upper limit of a 95% confidence interval are then the empirical2 .
5% quantile and the empirical 97 . θ ∗ ,b , b = 1 , . . . , B , respec-tively. Bootstrapping the procedure in Section 2.2 can be computationally expensive,especially when Var(ˆ θ − ˆ θ g ) and Var(ˆ θ g ) are estimated via the bootstrap. Thereis a computational shortcut such that bootstrap confidence intervals can beformed at virtually no additional computational cost: Note that the two varianceterms Var(ˆ θ − ˆ θ g ) and Var(ˆ θ g ) can often be estimated at rate o P (1 /n ). Inthis case, asymptotically, the dominant source of uncertainty in ˆ R mod ( g ) isestimation of ( θ g − θ ) . Hence, we can keep the estimators of the variance9Var(ˆ θ − ˆ θ g ) and ˆVar(ˆ θ − ˆ θ g ) fixed across bootstrap samples. More specifically,on each bootstrap sample b = 1 , . . . , B we propose to minimize the risk criterionˆ R mod,shortcut,b ( g ) = ((ˆ θ g ( D ∗ b ) − ˆ θ ( D ∗ b )) − ˆVar(ˆ θ − ˆ θ g )) + + ˆVar(ˆ θ − ˆ θ g ) , whereˆVar(ˆ θ − ˆ θ g ) = 1 B − B (cid:88) b =1 (cid:32) ˆ θ ( D ∗ b ) − ˆ θ g ( D ∗ b ) − B (cid:88) b (cid:48) ˆ θ ( D ∗ b (cid:48) ) − ˆ θ g ( D ∗ b (cid:48) ) (cid:33) , ˆVar(ˆ θ g ) = 1 B − B (cid:88) b =1 (cid:32) ˆ θ g ( D ∗ b ) − B (cid:88) b (cid:48) ˆ θ g ( D ∗ b (cid:48) ) (cid:33) . Note that these terms only depend on B evaluations of ˆ θ g ( · ) for g = 0 , . . . , G ,which drastically speed up computation compared to naive computation of theintervals. We evaluate the actual coverage of intervals formed via this simplifiedbootstrap procedure in Section 4.Note that the confidence intervals described above are only expected to bevalid in the large-sample limit and for a small number of estimators G . Forsmall sample size, the selection procedure might select a biased estimator withhigh probability. In this case, bootstrap confidence intervals are not reliable ingeneral. If the procedure is tasked to select among a large number of estimators,a selective inference approach might be appropriate (Taylor and Tibshirani,2015). In this section, we discuss applications of the proposed methodology. Comparedto existing literature on model selection for causal effects, instead of selectingamong nuisance parameter models, we consider shrinking between different func-tionals of the data generating distribution. As we will see, doing so can leadto drastic improvements in the mean-squared error. However, there is no freelunch. Compared to the baseline procedure, for fixed n , model selection canlead to increased risk in parts of the parameter space.In the following we will use potential outcomes to define causal effects. Weare interested in the causal effect of a treatment T ∈ { , } on an outcome Y .We use the potential outcome framework (Rubin, 1974; Splawa-Neyman et al.,1990). Let Y (1) denote the potential outcome under treatment T = 1 and Y (0)the potential outcome under treatment T = 0. We assume a superpopulationmodel, i.e. Y (1) and Y (0) are random variables. In the following, the goal is toestimate the average treatment effect θ = E [ Y (1) − Y (0)] . (7)Many methods have been designed to estimate (7) and these methods operateunder a variety of assumptions. We present several applications that are based10n different sets of assumptions for identifying (7). In each of the cases, wecompare the proposed method (6, termed “targeted selection”) with the cross-validation procedure (1) and with a baseline estimator. The code can be foundat github.com/rothenhaeusler/tms. In observational studies, it is common practice to estimate causal effects un-der the assumption of unconfoundedness and under the overlap assumption.Roughly speaking, the overlap assumption states that treatment assignmentprobabilities are bounded away from zero and one, conditional on covariates X . If these assumptions are met, it is possible to identify the average treat-ment effect via matching, inverse probability weighting, regression adjustment,or doubly robust methods (Hernan and Robins, 2020; Imbens and Wooldridge,2009). However, if the overlap is limited, estimating the average treatment effectcan be unreliable.To deal with the issue of limited overlap, researchers sometimes switch todifferent estimands such as the average effect on the treated (ATT) or the overlapweighted effect (Crump et al., 2006). In the following, we will focus on theoverlap-weighted effect as it is the causal contrast that can be estimated withthe lowest asymptotic variance in certain scenarios (Crump et al., 2006). Theoverlap-weighted effect is defined as θ = E [ p ( T = 1 | X )(1 − p ( T = 1 | X )) τ ( X )] E [ p ( T = 1 | X )(1 − p ( T = 1 | X ))] , where τ ( x ) = E [ Y (1) − Y (0) | X = x ]. Note that if the treatment effect ishomogeneous τ ( x ) ≡ const., then the overlap-weighted effect and the averagetreatment effect coincide, that means θ = θ . Thus, shrinkage towards anefficient estimator of the overlap effect is potentially beneficial under treatmenteffect homogeneity.We investigate shrinking between estimators of the average treatment effectand the overlap-weighted effect in a data-driven way. The proposed modelselection tool will be used to trade off bias and variance. We observe 1000 independent and identically distributed draws ( Y i ( T i ) , T i , X i )of a distribution P , where the X i are covariates. The data generating processwas chosen such that there is limited overlap, i.e. P [ T = 1 | X = 0] ≈ Y (0) , Y (1)) ⊥ T | X (Rosenbaum and Rubin, 1983). As discussed above, the causal effect can beestimated via doubly robust methods such as augmented inverse probabilityweighting, among others (Hernan and Robins, 2020). The data is generated11ccording to the following equations: (cid:15) Y ∼ N (0 , X ∼ Ber( . T ∼ (cid:40) Ber( .
7) if X = 1 , Ber( .
05) if X = 0 ,Y ( t ) = X t + 3 ts X + (cid:15) Y , (8)where s ∈ [0 , s = 0, the treatment effect is homogeneous as τ ( x ) ≡ s = 0, the overlap-weighted effect coincides with the average treatmenteffect. We can estimate that average treatment effect via augmented inverse probabilityweighting (Robins et al., 1994), ˆ θ = ˆ µ − ˆ µ , where ˆ µ a = 1 n n (cid:88) i =1 Y i T i = a ˆ p ( T i | X i ) − T i = a − ˆ p ( T i | X i )ˆ p ( T i | X i ) ˆ Q ( X i , a ) , and where ˆ Q ( x, t ) is the empirical mean of Y given X = x and T = t and ˆ p ( ·|· )are empirical probabilities. Similarly as above, we can estimate the overlapeffect by augmented probability weighting,ˆ θ = ˆ η − ˆ η n (cid:80) i ˆ p ( T i = 1 | X i )(1 − ˆ p ( T i = 1 | X i )) , where ˆ η a = 1 n n (cid:88) i =1 Y i T i = a (1 − ˆ p ( T i | X i )) − (1 T i = a − ˆ p ( T i | X i ))(1 − ˆ p ( T i | X i )) ˆ Q ( X i , a ) . For w ∈ { / , . . . , / } we defineˆ θ w = (1 − w )ˆ θ + w ˆ θ . (9)For s ≈
0, due to treatment effect homogeneity we expect E [(ˆ θ − θ ) ] < E [(ˆ θ − θ ) ]. For s ≈
1, we expect E [(ˆ θ − θ ) ] > E [(ˆ θ − θ ) ]. In the firstcase, the optimal estimator is ˆ θ w with w ≈
0. In the second case the optimalestimator is ˆ θ w with w ≈
1. 12 .0 0.2 0.4 0.6 0.8 1.0 . . . . . s a v e r age M SE targeted selectioncross−validationAIPW ATEAIPW overlap Figure 1: Mean-squared error R ( ˆ w ) where ˆ w is selected via the cross-validationcriterion or targeted selection. The data is drawn according to equation (8).Cross-validation and targeted selection are used to shrink between the AIPWATE estimator and the AIPW overlap estimator. The proposed method per-forms equal or better than cross-validation across most s ∈ [0 , The mean-squared error of the estimator selected by targeted selection and 10-fold cross-validation is depicted in Figure 1. Results are averaged across 200simulation runs. For s < .
3, targeted selection outperforms the baseline esti-mator ˆ θ . For . < s < .
8, targeted selection performs worse than ˆ θ . Overalmost the entire range s ∈ [0 , s ∈ [0 , s ∈ [0 , . The instrumental variables approach is a widely-used method to estimate causaleffect of a treatment T on a target outcome Y in the presence of confounding(Wright, 1928; Bowden and Turkington, 1990; Angrist et al., 1996). Roughlyspeaking, the method relies on a predictor I (called the instrument) of thetreatment T that is not associated with the error term of the outcome Y . We13ill not discuss the assumptions behind instrumental variables in detail, butrefer the interested reader to Hernan and Robins (2020). We will focus onthe case, where I , T and Y are one-dimensional. Under IV assumptions andlinearity, the target quantity can be re-written as θ = E [ Y (1) − Y (0)] = Cov( I, Y )Cov(
I, T ) . Estimating this quantity can be challenging if the instrument is weak, i.e. ifCov(
I, T ) ≈
0. In this case, the approach can benefit from shrinkage towardsthe ordinary least-squares solution (Nagar, 1959; Theil, 1961; Rothenh¨ausleret al., 2020; Jakobsen and Peters, 2020). Doing so may decrease the variancebut generally introduces bias. We will focus on the case where we have someadditional observational data, where we observe T and Y , but where the instru-ment I is unobserved. We draw 500 i.i.d. observations according to the following equations:
I, H, (cid:15) T , (cid:15) Y ∼ N (0 , T = I H + (cid:15) T Y ( t ) = t − s H + (cid:15) Y (10)We vary s ∈ [0 , X and Y . We observe ( T i , Y i ( T i ) , I i ) for i = 1 , . . . , i = 501 , . . . , X and Y , but not theinstrument I . Formally, for i = 501 , . . . , T i , Y i ( T i )) drawnaccording to equation (10). In the linear case, the instrumental variables approach can be written asˆ b IV = ˆCov( I, Y )ˆCov(
I, T ) , where ˆCov denotes the empirical covariance over the observations i = 1 , . . . , b OLS = argmin b min c ˆ E [( Y − T b − c ) ] , where ˆ E denotes the empirical expectation over the observations i = 1 , . . . , .0 0.5 1.0 1.5 2.0 . . . . . . . s a v e r age M SE targeted selectioncross−validationIVOLS Figure 2: Mean-squared error R ( ˆ w ) where ˆ w is selected via the cross-validationcriterion or targeted selection. The data is drawn according to equation (10).Cross-validation and targeted selection is used to stabilize the instrumentalvariables approach by shrinking the estimate towards ordinary least-squares.The proposed method performs equal or better than cross-validation across all s ∈ [0 , s (cid:54) = 0, but potentially decreases variance. As candidate estimators, for any w ∈ { / , / , . . . , / } we consider convex combinations of OLS and IV,ˆ θ w = w ˆ b OLS + (1 − w )ˆ b IV . The mean-squared error of the estimator selected by targeted selection and 10-fold cross-validation is depicted in Figure 2. Results are averaged across 200simulation runs. Over the entire range s ∈ [0 , s ≈
0, targeted selection outperforms the IVapproach. For s > .
5, the proposed approach performs worse than the IV ap-proach. For s ≈
2, the proposed approach performs similar to the IV approach.We evaluated the realized coverage of confidence intervals with nominal coverage95% as described in Section 3.5. Across s ∈ [0 , . . s ∈ [0 , . .3 Experiment with proxy outcome One of the most popular estimators for causal effects in experimental settingsis difference-in-means. To improve variance, it is possible to adjust for pre-treatment covariates, see for example Lin (2013); Deng et al. (2013). Thisraises the question whether post-treatment covariates can be used to improvethe precision of causal effect estimates. This is indeed the case under additionalassumptions. For example, in some cases, the treatment effect can be writtenas the product θ = E [ Y | T = 1] − E [ Y | T = 0] = θ T → P · θ P → Y , (11)where θ T → P = E [ P | T = 1] − E [ P | T = 0] is the effect of the treatment on somesurrogate or proxy outcome P ∈ { , } ; and θ P → Y = E [ Y | P = 1] − E [ Y | P = 0]is the effect of the proxy on the outcome. It is well-known that estimators thatmake use of such decompositions can outperform the standard difference-in-means estimator in terms of asymptotic variance (Tsiatis, 2007; Athey et al.,2019; Guo and Perkovi´c, 2020). However, doing so can introduce bias if equa-tion (11) does not hold. We will use the proposed model selection procedureto shrink between difference-in-means and an estimator that is unbiased if thetreatment effect decomposition in equation (11) holds. We consider a simple experimental setting with a post-treatment variable P .For simplicity, let us consider an experiment with binary treatment T ∈ { , } ,a binary proxy outcome P ∈ { , } and outcome Y . We draw 200 i.i.d. obser-vations according to the following equations: T ∼ Ber( . (cid:15) P , (cid:15) Y ∼ N (0 , P ( t ) = 1 (cid:15) p ≤ t Y ( t ) = 12 P ( t ) + s t + (cid:15) Y (12)For s = 0, the outcome Y ( T ) is conditionally independent of the treatment,given the proxy P ( T ). In this case, the average treatment effect can be writtenin product form, θ = θ T → P · θ P → Y , and this decomposition can be leveragedfor estimation. For s (cid:54) = 0, this decomposition does not hold. The standard estimator to estimate causal effects from experiments is difference-in-means, ˆ θ = 1 (cid:80) T i (cid:88) i : T i =1 Y i − (cid:80) (1 − T i ) (cid:88) i : T i =0 Y i .
16f the proxy outcome is a valid surrogate, i.e. if Y ⊥ T | P, we can rewrite θ as θ = E [ Y | T = 1] − E [ Y | T = 0]= E [ E [ Y | P = 1] P + E [ Y | P = 0](1 − P ) | T = 1] − E [ E [ Y | P = 1] P + E [ Y | P = 0](1 − P ) | T = 0]= ( E [ Y | P = 1] − E [ Y | P = 0]) ( E [ P | T = 1] − E [ P | T = 0])= θ T → P · θ P → Y . Thus, in this case, we can also consider the product estimatorˆ θ = (cid:32) (cid:80) T i (cid:88) i : T i =1 P i − (cid:80) (1 − T i ) (cid:88) i : T i =0 P i (cid:33) · (cid:32) (cid:80) P i (cid:88) i : P i =1 Y i − (cid:80) (1 − P i ) (cid:88) i : P i =0 Y i (cid:33) We shrink between these two estimators, i.e. for w ∈ { / , . . . , / } we defineˆ θ w = (1 − w )ˆ θ + w ˆ θ . The mean-squared error of the estimator selected by targeted selection and 10-fold cross-validation is depicted in Figure 3. Results are averaged across 200simulation runs. Similarly as above, targeted selection performs similar or bet-ter than cross-validation. Targeted selection performs better than the baselinemodel for s < .
4. The proposed procedure performs worse than the baselineprocedure in the regime s ∈ [ . , s ≥
1, targeted selection approachesthe performance of difference-in-means. We evaluated the realized coverage ofconfidence intervals with nominal coverage 95% as described in Section 3.5.Across s ∈ [0 , . s ∈ [0 , . We have introduced a method that allows to conduct targeted parameter se-lection by estimating the bias and variance of candidate estimators. The the-oretical justification of the method relies on a linear asymptotic expansion ofthe estimator. The method is very general and can be used in both parametricand semi-parametric settings. Under regularity conditions, we showed that the17 .0 0.2 0.4 0.6 0.8 1.0 1.2 . . . . . s a v e r age M SE targeted selectioncross−validationdifference−in−meansproxy model Figure 3: Mean-squared error R ( ˆ w ) where ˆ w is selected via the cross-validationcriterion or the targeted selection. The data is drawn according to equation (12).Cross-validation and targeted selection is used to stabilize the difference-in-means estimator by shrinking towards an estimator that makes use of a proxyoutcome. The proposed method performs similar or better than cross-validationacross s ∈ [0 , . n → ∞ ,the modified risk criterion selects models with lower or equal risk than the base-line estimator ˆ θ . Furthermore, we discussed a computational shortcut to obtainbootstrap confidence intervals.In simulations, we showed that the method selects reasonable models andoutperforms the cross-validation procedure in most scenarios. The proposedmethod can decrease variance if the competing estimators are approximatelyunbiased. However, there is no free lunch. In transitional regimes, for fixed n , the estimator can perform worse than the baseline estimator ˆ θ . This isto be expected from statistical theory, see for example the discussion of theHodge-Le Cam estimator in (Van der Vaart, 2000).The theoretical justification of the proposed method relies on a linear ap-proximation of the estimator in a neighborhood of the parameter values θ g .Thus, it would be important to understand the performance of the method inscenarios where parameter estimates of some of the estimators are far from theparameter values. To quantify uncertainty, we develop a modified bootstrapprocedure. If the procedure is tasked to select among a large number of esti-mators, a selective inference approach might be more appropriate (Taylor andTibshirani, 2015). In Section 4.2, we have seen some preliminary evidence thatthe proposed methodology may be used to combine knowledge across data sets.The proposed method is not tailored to this special case. Thus, we believe thatit would be exciting to investigate whether the model selection can be furtherimproved for data fusion tasks. Acknowledgements
The author would like to thank Guillaume Basse, Peter B¨uhlmann, Guido Im-bens, Nicolai Meinshausen, Jonas Peters, and Bin Yu for inspiring discussions.19
Appendix
This appendix contains proofs for the theoretical results in the main paper.
Proof.
Case 1: θ = θ g First, note that as E [ e g ( n ) ] = o (1 /n ) and as E [ˆ θ g ] − θ g = o (1 / √ n ), E [(ˆ θ g − θ ) ] = 1 n Var( ψ g ) + o (cid:18) n (cid:19) . By assumption, n ˆVar(ˆ θ g − ˆ θ ) − Var( ψ g − ψ ) = o P (1) . Similarly, n ˆVar(ˆ θ g ) − Var( θ g ) = o P (1) . Combining these equations with the definition of ˆ R ( g ),ˆ R ( g ) = ( 1 n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i )) − n Var( ψ g − ψ ) + 1 n Var( ψ g ) + o P (1 /n ) . Using the Central Limit Theorem, √ n (cid:80) ni =1 ψ g ( D i ) − ψ ( D i ) converges to aGaussian random variable with mean zero and variance Var( ψ g − ψ ). Thus, n (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = n n (cid:32) √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) (cid:33) − n Var( ψ g − ψ ) + o P (1)= (cid:32) √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) (cid:33) − Var( ψ g − ψ ) + o P (1)converges to a random variable with mean zero. This concludes the proof of thecase θ g = θ . Case 2: θ g (cid:54) = θ Similarly as above, we can show thatˆ R ( g ) = ( θ g − θ ) + 2( θ g − θ ) 1 n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) + o P (1 / √ n ) , and E [(ˆ θ g − θ ) ] = ( θ g − θ ) + o (1 / √ n ) . √ n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ]) = 2( θ g − θ ) 1 √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) + o P (1) (13)Using the CLT, √ n (cid:80) ni =1 ψ g ( D i ) − ψ ( D i ) converges to a centered Gaussianrandom variable with variance Var( ψ g ( D ) − ψ ( D )). Using this fact in equa-tion (13) concludes the proof. Proof.
Case 1: θ g = θ . Let S k = { i : D i ∈ D ,k } . Analogously as in the proofof Theorem 1, it follows that n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]) = n K (cid:88) k | S k | (cid:88) i ∈ S k ψ g ( D i ) − n − | S k | (cid:88) i (cid:54)∈ S k ψ ( D i ) − n Var( ψ g ) + o P (1) . Recall that | S k | ∼ n ( K − /K . Using the CLT, for every k = 1 , . . . , K , √ n | S k | (cid:88) i ∈ S k ψ g ( D i ) − n − | S k | (cid:88) i (cid:54)∈ S k ψ ( D i ) converges to a centered Gaussian random variable with variance11 − α Var( ψ g ( D )) + 1 α Var( ψ ( D )) , where α = K . Hence, n K (cid:88) k | S k | (cid:88) i ∈ S k ψ g ( D i ) − n − | S k | (cid:88) i (cid:54)∈ S k ψ ( D i ) − n Var( ψ g ) converges to a random variable with asymptotic mean α (1 − α ) Var( ψ g ( D )) + 1 α Var( ψ ( D )) . As Var( ψ g ( D )) (cid:54) = Var( ψ ( D )), the asymptotic mean of n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ])is a nonzero quantity that depends on g . This concludes the proof of case 1. Case 2: θ (cid:54) = θ Similarly as above, we can show that˜ R ( g ) = ( θ g − θ ) + 2( θ g − θ ) 1 n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) + o P (1 / √ n ) . √ n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]) = 2( θ g − θ ) 1 √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) + o P (1) . (14)Using the CLT, √ n (cid:80) ni =1 ψ g ( D i ) − ψ ( D i ) converges to a centered Gaussianrandom variable with variance Var( ψ g ( D ) − ψ ( D )). Using this fact in equa-tion (14) concludes the proof. Proof.
Case 1: θ g = θ Let S k = { i : D i ∈ D ,k } . Inspecting the proof of Theorem 1 we obtain that n (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = (cid:32) √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) (cid:33) − Var( ψ g − ψ ) + o P (1) . Multiplying with α = 1 /K , we can rewrite this as nK (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = (cid:32) √ K √ n n (cid:88) i =1 ψ g ( D i ) − ψ ( D i ) (cid:33) − K Var( ψ g − ψ ) + o P (1) . Writing Z k = √ αn (cid:80) i (cid:54)∈ S k ψ g ( D i ) and X k = √ αn (cid:80) i (cid:54)∈ S k ψ ( D i ) and using α = K we obtain nK (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = (cid:32) K K (cid:88) k =1 Z k − X k (cid:33) − K Var( ψ g − ψ ) + o P (1) . Similarly, by inspecting the proof of Theorem 2, n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]) = n K (cid:88) k | S k | (cid:88) i ∈ S k ψ g ( D i ) − n − | S k | (cid:88) i (cid:54)∈ S k ψ ( D i ) − n Var( ψ g ) + o P (1) .
22e have ( n − | S k | ) ∼ αn , and | S k | ∼ ( K − αn . Thus, multiplying with α = 1 /K we can rewrite this as nK ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]) = αn (cid:32) K (cid:88) k K − αn (cid:88) i ∈ S k ψ g ( D i ) − αn (cid:88) i (cid:54)∈ S k ψ ( D i ) − n Var( ψ g ) (cid:33) + o P (1) . = 1 K (cid:88) k K − √ αn (cid:88) i ∈ S k ψ g ( D i ) − √ αn (cid:88) i (cid:54)∈ S k ψ ( D i ) − K Var( ψ g ) + o P (1)= 1 K (cid:88) k K − (cid:88) j (cid:54) = k Z j − X k − K Var( ψ g ) + o P (1) . Thus, to complete the proof, we have to show that the asymptotic variance ofthe first term is larger than the asymptotic variance of the second term: nK ( ˜ R ( g ) − E [(ˆ θ g − θ ) ])= 1 K (cid:88) k K − (cid:88) j (cid:54) = k Z j − X k − K Var( ψ g ) + o P (1) nK (cid:16) ˆ R ( g ) − E [(ˆ θ g − θ ) ] (cid:17) = (cid:32) K K (cid:88) k =1 Z k − X k (cid:33) − K Var( ψ g − ψ ) + o P (1) . Using the CLT, for n → ∞ and K fixed, ( X , . . . , X K , Z , . . . , Z K ) converge toa centered multivariate Gaussian vector with non-degenerate variance. Hence,without loss of generality in the following we can assume that the vector( X , . . . , X K , Z , . . . , Z K )is multivariate Gaussian. Recall that we assume | Cor( ψ g , ψ ) | (cid:54) = 1. Thus wealso have | Cor( Z j , X j ) | (cid:54) = 1. Using Lemma 1 completes the proof of case 1. Case 2: θ g (cid:54) = θ Inspecting the proofs of Theorem 1 and Theorem 2, we obtain that in thecase θ g = θ , the asymptotic variance of √ n ( ˆ R ( g ) − E [(ˆ θ g − θ ) ] is4( θ g − θ ) Var( ψ g ( D ) − ψ ( D )) , (15)23hich is the same as the asymptotic variance of √ n ( ˜ R ( g ) − E [(ˆ θ g − θ ) ]). Thisconcludes the proof. Lemma 1.
Let ( Z i , X i ) be i.i.d. Gaussian with mean zero and nonzero varianceand | Cor( Z i , X i ) | (cid:54) = 1. Let K ≥
2. Then,Var (cid:32) K K (cid:88) i =1 Z i − X i (cid:33) < Var K K (cid:88) i =1 K − (cid:88) j (cid:54) = i Z j − X i . (16) Proof.
As ( Z i , X i ) are multivariate Gaussian and | Cor( Z i , X i ) | (cid:54) = 1, X i = αZ i + (cid:15) i , for some α ∈ R , where ( (cid:15) i ) i is centered Gaussian and independent of ( Z i ) i with nonzero variance. Thus, it suffices to show thatVar (cid:32) K K (cid:88) i =1 Z i − αZ i − (cid:15) i (cid:33) < Var K K (cid:88) i =1 K − (cid:88) j (cid:54) = i Z j − αZ i − (cid:15) i . (17)On the left-hand side of equation (17), we haveVar (cid:32) K K (cid:88) i =1 Z i − αZ i (cid:33) − K K (cid:88) i =1 ( Z i − αZ i ) 1 K K (cid:88) i =1 (cid:15) i + (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) = Var (cid:32) K K (cid:88) i =1 Z i − αZ i (cid:33) + 4Var (cid:32) K K (cid:88) i =1 ( Z i − αZ i ) (cid:33) Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) + Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) On the right-hand side of equation (17) we haveVar K K (cid:88) i =1 K − (cid:88) j (cid:54) = i Z j − αZ i − K K (cid:88) i =1 (cid:15) i K − (cid:88) j (cid:54) = i Z j − αZ i + 1 K K (cid:88) i =1 ( (cid:15) i ) = Var K K (cid:88) i =1 K − (cid:88) j (cid:54) = i Z j − αZ i + 4Var K K (cid:88) i =1 (cid:15) i K − (cid:88) j (cid:54) = i Z j − αZ i + Var (cid:32) K K (cid:88) i =1 ( (cid:15) i ) (cid:33) . (cid:32) K K (cid:88) i =1 Z i − αZ i (cid:33) ≤ Var K K (cid:88) i =1 K − (cid:88) j (cid:54) = i Z j − αZ i Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) < Var (cid:32) K K (cid:88) i =1 ( (cid:15) i ) (cid:33) (cid:32) K K (cid:88) i =1 ( Z i − αZ i ) (cid:33) Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) ≤ K K (cid:88) i =1 (cid:15) i K − (cid:88) j (cid:54) = i Z j − αZ i The first two inequalities follow from Lemma 2 and Lemma 3. Let us now dealwith the last term.Var K K (cid:88) i =1 (cid:15) i K − (cid:88) j (cid:54) = i Z j − αZ i − Var (cid:32) K K (cid:88) i =1 ( Z i − αZ i ) (cid:33) Var (cid:32) K K (cid:88) i =1 (cid:15) i (cid:33) = (cid:88) i (cid:54) = j Var (cid:18) K (cid:15) i K − Z j (cid:19) + (cid:88) i Var (cid:18) K α(cid:15) i Z i (cid:19) − (1 − α ) K Var( Z )Var( (cid:15) )= Var( (cid:15) )Var( Z ) (cid:18) K ( K −
1) + α K − (1 − α ) K (cid:19) Thus, it suffices to show that1 K ( K − − K + α K − α K + 2 α K ≥ K ( K −
1) + α ( K − K + 2 α K ≥ . (18)Dividing by ( K − /K yields1( K − + α + 2 α ( K − ≥ . The left term can be written as (cid:18) K −
1) + α (cid:19) . This shows the inequality in equation (18), which completes the proof.25 emma 2.
Let (cid:15) i be i.i.d. centered Gaussian random variables with nonzerovariance and K ≥
2. Then,Var (cid:32) K (cid:88) i (cid:15) i (cid:33) < Var (cid:32) K K (cid:88) i =1 ( (cid:15) i ) (cid:33) (19) Proof.
On the left-hand side of equation (19) we have1 K Var( (cid:15) ) . On the right-hand side of equation (19) we have1 K Var( (cid:15) ) . This completes the proof.
Lemma 3.
Let Z i be i.i.d. centered Gaussian random variables and K ≥ (cid:32) K (cid:88) i Z i − αZ i (cid:33) ≤ Var K K (cid:88) i =1 K − (cid:88) j (cid:54) = i Z j − αZ i . Proof.
It suffices to show thatVar (cid:32) √ K (cid:88) i Z i − αZ i (cid:33) ≤ Var K (cid:88) i =1 K − (cid:88) j (cid:54) = i Z j − αZ i . (20)On the left-hand side we have (1 − α ) Var( Z ) . On the right-hand side of equation (20) we haveVar K − (cid:88) i Z i + ( K − K − (cid:88) i>j Z i Z j − (cid:88) i>j α ( K −
1) 4 Z i Z j + α K (cid:88) i =1 Z i = K ( 1( K −
1) + α ) Var( Z ) + ( ( K − K − − α ( K −
1) ) K ( K − Z )= Var( Z )( Kα − α ( K − K ( K − + α (2 K ( K −
1) + 4 K ( K −
1) )+ K ( K −
1) + K ( K − ( K − ) 26sing the two inequalities above, it suffices to show that for all α and all K ≥ Kα + 6 α K ( K − − α ( K − K ( K − + K ( K −
1) + K ( K − ( K − ≥ α − α + 6 α − α + 1Rearranging, it suffices to show that( K − α + 4 α + 6 α K −
1) + α K − + K ( K −
1) + K ( K − − ( K − ( K − ≥ . Multiplying with ( K − K − α + 4 α ( K −
1) + 6 α + α K −
1) + K ( K −
1) + K ( K − − ( K − ( K − ≥ , which is equivalent to( K − α + 4 α ( K −
1) + 6 α + α K − K − K + K − K + 4 K − K + 3 K − K + 1( K − ≥ . This inequality is equivalent to( K − α + 4 α ( K −
1) + 6 α + α K −
1) + 1( K − ≥ . (21)Rearranging the left-hand side, we obtain( K − (cid:18) α + 1( K − (cid:19) . This proves the inequality of equation (21), which completes the proof.
Proof.
By assumption, we have ˆ θ g − ˆ θ = θ g − θ + O P (1 / √ n ). If θ g − θ (cid:54) = 0,ˆ R mod ( g ) = ( θ g − θ ) + O P (1 / √ n ) . On the other hand, if θ g = θ ,ˆ R mod ( g ) = O P (1 /n ) . P [ θ ˆ g = θ ] → . Now consider any g with θ g = θ and Var( ψ g ) > Var( ψ ). Then,ˆ R mod ( g ) ≥ Var(ˆ θ g ) + o P (1 /n ) , and ˆ R mod (0) = Var(ˆ θ ) + o P (1 /n ) . By assumption, Var(ˆ θ g ) = n Var( ψ g )+ o (1 /n ) and Var(ˆ θ ) = n Var( ψ )+ o (1 /n ).Furthermore, by assumption Var( ψ g ) > Var( ψ ). Thus, lim n n ˆ R mod (0) < lim n n ˆ R mod ( g ) for n → ∞ . As this holds for all g with Var( ψ g ) > Var( ψ )for n → ∞ , this concludes the proof. 28 eferences Akaike, H. (1974). A new look at the statistical model identification. In
SelectedPapers of Hirotugu Akaike , pages 215–222. Springer.Angrist, J., Imbens, G., and Rubin, D. (1996). Identification of causal effectsusing instrumental variables.
Journal of the American Statistical Association ,91(434):444–455.Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures formodel selection.
Statistics surveys .Athey, S., Chetty, R., and Imbens, G. (2020). Combining experimental andobservational data to estimate treatment effects on long term outcomes. arXivpreprint arXiv:2006.09676 .Athey, S., Chetty, R., Imbens, G. W., and Kang, H. (2019). The surrogateindex: Combining short-term proxies to estimate long-term treatment effectsmore rapidly and precisely. Technical report, National Bureau of EconomicResearch.Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causaleffects.
Proceedings of the National Academy of Sciences .Bareinboim, E. and Pearl, J. (2016). Causal inference and the data-fusion prob-lem.
Proceedings of the National Academy of Sciences , 113(27):7345–7352.Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selection inference.
The Annals of Statistics , 41(2):802–837.Bowden, R. and Turkington, D. (1990).
Instrumental variables , volume 8. Cam-bridge university press.Brookhart, M. and Van Der Laan, M. (2006). A semiparametric model selectioncriterion with applications to the marginal structural model.
Computationalstatistics & data analysis , 50(2):475–498.Claeskens, G. and Hjort, N. (2003). The focused information criterion.
Journalof the American Statistical Association .Claeskens, G. and Hjort, N. (2008).
Model selection and model averaging . Cam-bridge University Press.Crump, R., Hotz, V., Imbens, G., and Mitnik, O. (2006). Moving the goalposts:Addressing limited overlap in the estimation of average treatment effects bychanging the estimand. Technical report, National Bureau of Economic Re-search.Cui, Y. and Tchetgen-Tchetgen, E. (2019). Bias-aware model selection for ma-chine learning of doubly robust functionals. arXiv preprint arXiv:1911.02029 .29eng, A., Xu, Y., Kohavi, R., and Walker, T. (2013). Improving the sensitivityof online controlled experiments by utilizing pre-experiment data. In
Pro-ceedings of the sixth ACM international conference on Web search and datamining , pages 123–132.Friedman, J., Hastie, T., and Tibshirani, R. (2001).
The elements of statisticallearning . Springer.Guo, F. and Perkovi´c, E. (2020). Efficient least squares for estimating total ef-fects under linearity and causal sufficiency. arXiv preprint arXiv:2008.03481 .Heckman, J., Ichimura, H., and Todd, P. (1998). Matching as an econometricevaluation estimator.
The review of economic studies , 65(2):261–294.Hernan, M. and Robins, J. (2020).
Causal inference: What If . Chapman &Hall.Imbens, G. and Wooldridge, J. (2009). Recent developments in the econometricsof program evaluation.
Journal of economic literature , 47(1):5–86.Jakobsen, M. and Peters, J. (2020). Distributional robustness of k-class estima-tors and the pulse. arXiv preprint arXiv:2005.03353 .Kapelner, A., Bleich, J., Levine, A., C., Z. D., DeRubeis, R., and Berk, R.(2014). Inference for the effectiveness of personalized medicine with software. arXiv preprint arXiv:1404.7844 .LaLonde, R. (1986). Evaluating the econometric evaluations of training pro-grams with experimental data.
The American economic review , pages 604–620.Lin, W. (2013). Agnostic notes on regression adjustments to experimental data:Reexamining freedman’s critique.
The Annals of Applied Statistics , 7(1):295–318.Nagar, A. (1959). The bias and moment matrix of the general k-class estimatorsof the parameters in simultaneous equations.
Econometrica , pages 575–595.Nie, X. and Wager, S. (2017). Quasi-oracle estimation of heterogeneous treat-ment effects. arXiv preprint arXiv:1712.04912 .Peters, J., B¨uhlmann, P., and Meinshausen, N. (2016). Causal inference byusing invariant prediction: identification and confidence intervals.
Journal ofthe Royal Statistical Society: Series B , 78(5).Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N., Hastie, T., and Tibshirani,R. (2018). Some methods for heterogeneous treatment effect estimation inhigh dimensions.
Statistics in medicine .30obins, J., Rotnitzky, A., and Zhao, L. (1994). Estimation of regression coeffi-cients when some regressors are not always observed.
Journal of the AmericanStatistical Association .Rolling, C. and Yang, Y. (2014). Model selection for estimating treatmenteffects.
Journal of the Royal Statistical Society: Series B .Rosenbaum, P. and Rubin, D. (1983). Assessing sensitivity to an unobservedbinary covariate in an observational study with binary outcome.
Journal ofthe Royal Statistical Society: Series B , 45(2):212–218.Rothenh¨ausler, D., Meinshausen, N., B¨uhlmann, P., and Peters, J. (2020).Anchor regression: heterogeneous data meets causality. arXiv preprintarXiv:1801.06229 .Rubin, D. (1974). Estimating causal effects of treatments in randomized andnonrandomized studies.
Journal of educational Psychology , 66(5):688.Schuler, A., Baiocchi, M., Tibshirani, R., and Shah, N. (2018). A comparisonof methods for model selection when estimating individual treatment effects. arXiv preprint arXiv:1804.05146 .Schwarz, G. (1978). Estimating the dimension of a model.
Annals of Statistics ,6(2).Splawa-Neyman, J., Dabrowska, D., and Speed, T. (1990). On the applicationof probability theory to agricultural experiments.
Statistical Science , pages465–472.Taylor, J. and Tibshirani, R. (2015). Statistical learning and selective inference.
Proceedings of the National Academy of Sciences , 112(25):7629–7634.Theil, H. (1961).
Economic forecasts and policy . North-Holland Pub. Co.Tsiatis, A. (2007).
Semiparametric theory and missing data . Springer Science& Business Media.Van der Laan, M. and Robins, J. (2003).
Unified methods for censored longitu-dinal data and causality . Springer.Van der Vaart, A. (2000).
Asymptotic statistics . Cambridge university press.Wright, P. (1928).
Tariff on animal and vegetable oils . Macmillan Company,New York.Zhao, Y., Fang, X., and Simchi-Levi, D. (2017). Uplift modeling with multipletreatments and general response types. In